feat(fanout): migrate SlackNotifyFn to FanOutConsumer subscriber (#64)#79
Conversation
…-samples#64) Move the Slack outbound delivery off its own DynamoDB Streams consumer onto FanOutConsumer as a per-channel dispatcher. Drops TaskEventsTable from 2 concurrent stream readers to 1, restoring headroom for future channels (Email, Teams, etc.) without exceeding the documented DynamoDB Streams 2-reader-per-shard practical limit. The PR also addresses an adversarial code review on the original migration; the body below walks through each piece in the order it landed. ## (a) Migration - `cdk/src/handlers/slack-notify.ts` — rewritten as exported `dispatchSlackEvent(event, ddb)` plus a tagged `SlackApiError` class. The standalone `handler(event)` stream entrypoint is gone; the FanOutConsumer is now the only TaskEventsTable stream reader. Behaviour preserved bit-for-bit: channel_source==='slack' gate, terminal-event dedup via conditional UpdateItem on `channel_metadata.slack_notified_terminal`, threaded replies under the @mention or task_created message, emoji transitions (eyes -> hourglass -> ✅/❌/🚫/⏲), DM channel_id -> user_id rewrite, intermediate session+created message cleanup on terminal events. - `cdk/src/handlers/fanout-task-events.ts` — replaces the log-only `dispatchToSlack` stub with a wrapper that calls dispatchSlackEvent and routes errors via the new typed contract (see (b) below). Slack defaults gain task_created, session_started, task_timed_out so the router fans out the lifecycle events the old SlackNotifyFn handled; the dispatcher's channel_source gate keeps non-Slack tasks unaffected. - `cdk/src/constructs/fanout-consumer.ts` — adds a scoped `secretsmanager:GetSecretValue` grant on `bgagent/slack/*` so the fanout Lambda can fetch per-workspace bot tokens. Same scope the old SlackNotifyFn role held. - `cdk/src/constructs/slack-integration.ts` — deletes SlackNotifyFn, its DynamoEventSource, its IAM policy, and its NagSuppressions entry. Drops the now-unused StartingPosition / FilterCriteria / FilterRule / lambdaEventSources imports. After this lands, `aws lambda list-event-source-mappings` shows exactly one consumer of the TaskEventsTable stream (FanOutFn); verified on the dev stack with end-to-end @mention + cancel + CLI isolation scenarios. ## (b) Review fix #1 — partial-batch retry semantics (BLOCKER) The first review pass found that the post-migration handler silently dropped Slack-side infra errors (DDB throttle on the GetItem, Secrets Manager 5xx, transient Slack timeout). Pre-migration the SlackNotifyFn handler rethrew non-SlackApiError so Lambda retried the batch; post-migration `Promise.allSettled` swallowed the rejection and routeEvent returned an empty list with no escalation path to `batchItemFailures`. routeEvent's return type changed from `NotificationChannel[]` to `{ dispatched, infraRejections }`. The handler now pushes the record into `batchItemFailures` whenever `infraRejections.length>0`, so Lambda replays the record under the partial-batch contract. The warn line on rejection is tagged `retryable: true` so operators can alert distinctly from the channel-terminal swallow path. GitHub got the symmetric treatment: 4xx (excluding the existing 401 and 404 handling) is now treated as a channel-terminal swallow via `fanout.github.api_error` instead of escalating to retry. ## (c) Review fix aws-samples#2 — split SlackApiError into terminal + retryable Originally any `!result.ok` Slack response was wrapped in SlackApiError and swallowed. That collapsed retryable codes (`ratelimited`, `service_unavailable`, `internal_error`, `fatal_error`, `request_timeout`) into the same swallow as `channel_not_found` — a tier-1 Slack outage would silently drop every message. Introduced `TERMINAL_SLACK_API_ERRORS` set + `classifySlackError` helper. Terminal codes still throw SlackApiError (router swallows). Retryable codes throw a plain Error so the router classifies them as infra rejections and Lambda replays. ## (d) Review fix aws-samples#3 — NOTIFIABLE_EVENTS / CHANNEL_DEFAULTS drift The original migration added task_created/session_started/task_timed_out to CHANNEL_DEFAULTS.slack but the dispatcher's NOTIFIABLE_EVENTS gate already excluded several events the router was subscribing Slack to (agent_error, pr_created, task_stranded). Result: Slack was reported as `dispatched` for events it silently dropped — telemetry lied, agent_error never reached operators on Slack-origin tasks, and task_stranded rendered the generic "Event: task_stranded for owner/repo" fallback (UX regression). Added render cases for task_stranded and agent_error in slack-blocks.ts and added them to NOTIFIABLE_EVENTS. Forward-compat approval_required and status_response stay out of NOTIFIABLE_EVENTS until their emitters ship; a new cross-file consistency test in fanout-task-events.test.ts fails if anyone re-introduces the drift. The Slack dispatcher wrapper now passes `effectiveEventType` so an agent_milestone(pr_created) wrapper is unwrapped before NOTIFIABLE_EVENTS matching. Without the rewrite, the dispatcher would short-circuit on the wrapper string `agent_milestone`. ## (e) Review fix aws-samples#4 — conditional UpdateItem on lifecycle persists Once the BLOCKER fix made batches retry, the original task_created and session_started UpdateItem calls became hazardous: a Slack POST that succeeded but whose follow-up UpdateItem failed transiently would, on retry, post a second root and overwrite slack_thread_ts — orphaning every threaded reply that had threaded under the first ts. Both UpdateItems now carry an `attribute_not_exists` ConditionExpression on the relevant `channel_metadata.slack_*_msg_ts`. On ConditionalCheckFailedException the handler logs at info, deletes the duplicate Slack message via `chat.delete`, and returns. Sibling retry wins the race; the duplicate is cleaned up. ## (f) Dev-stack regression: drop pr_created from Slack defaults Live verification surfaced a UX duplication: pr_created (subscribed in CHANNEL_DEFAULTS.slack as the original §6.2 design called for) and task_completed both rendered messages with View PR buttons, posted seconds apart. The original SlackNotifyFn had silently dropped pr_created (NOTIFIABLE_EVENTS gate), so users hadn't relied on it. Removed pr_created from CHANNEL_DEFAULTS.slack and from NOTIFIABLE_EVENTS, and removed the prCreatedMessage renderer. GitHub keeps pr_created (its edit-in-place comment surface genuinely benefits from the early checkpoint). ## Verification - mise //cdk:compile — clean - mise //cdk:test — 1183 / 1183 pass (8 net-new tests added for the review fixes: NOTIFIABLE_EVENTS drift guard, retryable Slack codes, GitHub 4xx swallow, infra rejection escalation, SlackApiError swallow, task_stranded render) - mise //cdk:eslint — clean - mise //cdk:synth — confirms exactly one Lambda::EventSourceMapping on TaskEventsTable, pointing at FanOutFn - Dev-stack scenarios — @mention happy path, Cancel button, CLI submit (channel_source=api -> zero Slack dispatches, GitHub edit-in-place still fires) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Automated review: This is well-considered work. The partial-batch retry semantics, the terminal/retryable Slack error classification, and the conditional UpdateItem idempotency guards are all the kind That said, the migration introduces coupling trade-offs and error-path gaps that warrant attention. CRITICAL (must fix before merge)
File: cdk/src/handlers/fanout-task-events.ts (4xx swallow branch) The new else if (httpStatus >= 400 && httpStatus < 500) swallow path treats ALL GitHub 4xx as channel-terminal. But HTTP 403 from GitHub frequently means rate limit exceeded (a transient condition), and HTTP 429 (Too Many Requests) is also 4xx. Under a reconciliation wave touching many tasks, an entire window of GitHub comment updates would be permanently lost with only a warn log. Fix: Carve out 403 and 429:
File: cdk/src/constructs/fanout-consumer.ts Every other external-service grant in FanOutConsumer (taskTable, repoTable, githubTokenSecret) is guarded by if (props.X). The new bgagent/slack/* Secrets Manager grant is unconditional — meaning dev deployments without Slack onboarding get a dangling IAM permission. This breaks the construct's documented contract and will trigger a cdk-nag AwsSolutions-IAM5 finding with a misleading suppression reason. Fix: Add slackSecretPrefix?: string prop, guard the policy statement, and add a dedicated NagSuppression with the correct reason. HIGH (strongly recommended)
File: cdk/src/handlers/slack-notify.ts — early guard in dispatchSlackEvent When the env var is missing, the function returns (no throw), so the router counts Slack as dispatched. A broken deployment silently drops ALL Slack notifications indefinitely. This should throw an Error (infra-class) so the rejected-rate alarm fires and operators notice.
When Slack returns ratelimited, the record lands in batchItemFailures. On retry, all three dispatchers re-execute — including GitHub and Email that already succeeded. For non-terminal events without conditional-persist guards, this means duplicate GitHub PATCHes and double log entries. Under sustained Slack rate-limiting, every record retries 3× before DLQ, each time re-invoking GitHub unnecessarily. Mitigation options: Per-channel "already-dispatched" bitmap, or document as accepted trade-off at current scale and surface Slack's Retry-After header in the warn log.
File: cdk/src/handlers/slack-notify.ts — addReaction, removeReaction, deleteMessage These functions catch everything (DNS failures, SyntaxError, timeouts) and log at warn. Under the new architecture where infra errors should propagate for retry, these swallows are inconsistent. The user-visible symptom: stale emoji reactions (hourglass remains after completion), orphaned intermediate messages. Recommendation: At minimum, log network/infra failures at error with a dedicated event key so operators can alarm.
When ConditionalCheckFailedException fires (sibling retry won the race), the code calls deleteMessage which is best-effort. If the delete fails, a permanent duplicate Slack message Recommendation: Add a fanout.slack.dup_delete_failed event on the delete failure path. MEDIUM (address in follow-up)
The fanout handler catches err instanceof SlackApiError. If the bundler ever duplicates the module (rare but possible), instanceof fails silently — terminal errors would trigger infinite retries. Add a property-based fallback:
The set omits ekm_access_denied, team_access_not_granted, missing_scope, invalid_arguments, posting_to_general_channel_denied — all permanent failures that will burn retries until DLQ.
readonly dispatched: NotificationChannel[] prevents reassignment but allows outcome.dispatched.push(...). Inconsistent with ReadonlySet used for CHANNEL_DEFAULTS in the same file.
The decoupling is purely nominal — both have identical fields. A type alias (type SlackDispatchEvent = FanOutEvent) or Pick<FanOutEvent, ...> would prevent silent drift and make the Test Coverage Gaps ┌────────────────────────────────────────────────────────────────────────┬─────────────┐ Architecture Trade-offs (Accepted) These are design decisions I'd want documented but not necessarily changed:
What's Done Well
|
…ws-samples#79 review #1) PR aws-samples#79 review found that the new 4xx terminal-swallow path treats HTTP 403 and 429 as channel-terminal — but on GitHub these are transient rate-limit responses (403 with "API rate limit exceeded", 429 "Too Many Requests"). Under a reconciliation wave that touches many tasks, an entire window of GitHub comment updates would be permanently dropped with only a warn log. Carve out 403 and 429 from the swallow guard so they propagate as infra rejections through ``Promise.allSettled``. The record lands in ``batchItemFailures`` and Lambda replays until the rate-limit window clears (or DLQs after ``retryAttempts``). Test coverage: parametrized over 403 + 429 with a GitHubCommentError mock at the helper boundary, asserting the record's eventID surfaces in ``batchItemFailures`` rather than being absorbed.
review aws-samples#2) Every other external-service grant in FanOutConsumer (taskTable, repoTable, githubTokenSecret) is gated by ``if (props.X)``, so a deployment that hasn't onboarded the corresponding service stays free of dangling IAM permissions. The original migration broke the pattern with an unconditional ``bgagent/slack/*`` Secrets Manager grant — dev stacks without Slack onboarding ended up holding read permission on a resource pattern they never use, with a misleading ``cdk-nag AwsSolutions-IAM5`` suppression reason. Adds an optional ``slackSecretArnPattern`` prop on ``FanOutConsumerProps``; the policy statement is only attached when the prop is set. ``cdk/src/stacks/agent.ts`` now computes the ``bgagent/slack/*`` ARN inline and passes it through, mirroring the other guarded props. ``ArnFormat`` and ``Stack`` imports moved out of fanout-consumer.ts since the construct no longer needs them. No changes to live behaviour — agent.ts always passes the prop, so the IAM policy still attaches in production. The dispatcher will log-and-fail-retry on a missing pattern (covered by review aws-samples#3 fix). Test gap covering the construct itself ships in a follow-up commit (test gap aws-samples#34).
…review aws-samples#3) Pre-fix: when ``TASK_TABLE_NAME`` was unset on a Slack-subscribed event, ``dispatchSlackEvent`` returned silently after a warn line. The router counted Slack as ``dispatched`` and a broken stack quietly dropped every Slack notification — operators only saw it in the warn-rate metric, with no rejected-channel signal. Post-fix: throw a plain Error so the rejection propagates as an infra rejection through ``Promise.allSettled``. The router pushes the record into ``batchItemFailures``, Lambda retries the batch, the ``fanout.dispatcher.rejected`` warn fires per record, and operators get a distinct alarm. Also bumps the existing log line from ``warn`` to ``error`` and attaches an ``error_id: FANOUT_SLACK_MISSING_TASK_TABLE`` so the deployment-bug case can be distinguished from per-record failures. Test: ``throws when TASK_TABLE_NAME env var is missing`` deletes the env var, asserts the throw, asserts no DDB call was attempted (env-var guard fires first).
…amples#79 review aws-samples#7) When a bundler ever duplicates the slack-notify module (rare with NodejsFunction tree-shaking but possible if dual-bundled), two distinct SlackApiError classes coexist and ``instanceof`` against one fails for instances of the other. The dispatcher would see a foreign-class SlackApiError, fall through to the rethrow branch, and the router would treat it as an infra rejection — flipping a channel-terminal swallow into infinite Lambda retries. Add an ``err.name === 'SlackApiError'`` fallback so the swallow branch fires either way. Mirrors the duck-typed ``GitHubCommentError`` check used elsewhere in the same handler. Test: synthesise a plain Error with name === 'SlackApiError' (NOT an instance of the mock's SlackApiError class) and assert batchItemFailures stays empty — proving the swallow path catches both shapes.
…ws-samples#79 review aws-samples#8) Original set omitted documented Slack permission/scope failures. Codes outside the set fall to the retryable branch, so a misconfiguration like ``ekm_access_denied`` or ``missing_scope`` would burn 3 Lambda retries before DLQ on every event — even though the failure is fundamentally a configuration bug that no retry can clear. Adds: - Permission/scope: missing_scope, ekm_access_denied, team_access_not_granted, posting_to_general_channel_denied - Payload shape: invalid_arguments Reorganized the set into commented blocks (channel-shape, auth, permission/scope, payload-shape) so future additions go in the right bucket and the rationale stays visible. Test coverage: parametrized over the full TERMINAL_SLACK_API_ERRORS set (21 codes) — every one must throw SlackApiError so the router swallows it. The existing retryable test.each remains intact and covers the negative-class case (codes outside the set throw a plain Error and escalate to retry).
…gs (aws-samples#79 review aws-samples#5) The reaction / delete helpers (``addReaction``, ``removeReaction``, ``deleteMessage``) used to log every catch at warn with a single generic event key, lumping API-level rejections (e.g. ``no_reaction``) together with infrastructure failures (DNS lookup, TLS handshake, fetch timeout, JSON parse error from a hostile gateway). Operators who alarmed on the warn rate saw a flat signal that masked genuine infra problems. Split the boundary: - API-level (``!result.ok`` after a successful HTTP call) stays at warn with channel-specific event keys (``fanout.slack.reaction_add_api_error``, ``fanout.slack.reaction_remove_api_error``, ``fanout.slack.message_delete_api_error``). These are per-message UX problems; operators don't page. - Network errors (the outer ``catch (err)`` after ``fetch``) promote to ``logger.error`` with dedicated event keys (``fanout.slack.reaction_add_network_error``, ``fanout.slack.reaction_remove_network_error``, ``fanout.slack.message_delete_network_error``) and ``error_id``s (``FANOUT_SLACK_REACTION_NETWORK``, ``FANOUT_SLACK_DELETE_NETWORK``) so each has its own alarmable signal. User-visible symptoms when these fire silently: stale emoji reactions (hourglass never swaps to ✅) and orphaned intermediate messages. Behaviour unchanged: errors are still swallowed (per-message reactions and intermediate cleanup are best-effort by design; they must not fail the batch), but operators now get distinct metrics for each failure class.
…umulation (aws-samples#79 review aws-samples#6) The conditional UpdateItem dup-delete path (``task_created`` / ``session_started`` lifecycle persists) calls ``deleteMessage`` to clean up the duplicate Slack message that landed when a sibling retry won the race. The delete is inherently best-effort — but if it fails, the duplicate becomes a permanent ghost in the thread and operators had no way to alarm on the rate. Refactor ``deleteMessage`` to return a boolean (``true`` on success or ``message_not_found``-as-already-gone, ``false`` otherwise) and emit a dedicated ``fanout.slack.dup_delete_failed`` event with an ``error_id: FANOUT_SLACK_DUP_DELETE_FAILED`` from the dup-delete callsites when the cleanup couldn't complete. The terminal-event cleanup paths (``slack_session_msg_ts``, ``slack_created_msg_ts``) intentionally don't fire this event — those paths target genuinely-stale UX cleanup, not retry-driven duplicates, so an alarm there would be noise. No new tests beyond the existing dup-delete coverage; the ``deleteMessage`` return value isn't yet asserted at the unit level, but the behavior is fully exercised by the existing ``dup-delete`` integration paths (test gap aws-samples#31 will add an explicit failure-path assertion when it lands).
…les#79 review aws-samples#9) ``RouteOutcome.dispatched`` and ``infraRejections`` were typed as plain ``NotificationChannel[]`` — which made ``readonly`` on the property prevent reassignment but still allow callers to mutate the underlying array via ``.push``, ``.splice``, or ``.sort``. Inconsistent with the ``ReadonlySet<string>`` used for ``CHANNEL_DEFAULTS`` in the same file. Tightening to ``ReadonlyArray<NotificationChannel>`` makes the contract honest: the router owns the arrays, callers read them. Test suite updated to use ``[...outcome.dispatched].sort()`` where it previously called ``.sort()`` directly — the explicit copy makes the intent clear and would have surfaced any silent test-side mutation.
…aws-samples#79 review aws-samples#10) The two interfaces were structurally identical: same five fields, same readonly modifiers, same metadata shape. The decoupling was purely nominal and a silent-drift footgun — adding a field to ``FanOutEvent`` (e.g. when the router starts plumbing an ``approval_required`` ID through) would not flow into ``SlackDispatchEvent``, leaving the dispatcher unaware until a downstream test happened to fail. Replace with a one-line type alias: export type SlackDispatchEvent = FanOutEvent; The slack-notify module now type-imports ``FanOutEvent`` from fanout-task-events. ``import type`` is erased at compile time, so the runtime bundle still has the one-way dep (fanout-task-events → slack-notify) — no module-cycle hazard. Reviewer-suggested ``Pick<FanOutEvent, 'task_id' | …>`` was considered and rejected: the dispatcher uses every field of ``FanOutEvent``, so the Pick would just enumerate the same five fields with extra noise. A direct alias keeps the intent obvious and prevents drift identically.
…After (aws-samples#79 review aws-samples#4) PR aws-samples#79 review aws-samples#4 surfaced a sibling-channel-failure hazard: when GitHub or Email rate-limits, the record lands in ``batchItemFailures``. On the Lambda retry, every Slack-subscribed event for that record runs again. Terminal events were already guarded by ``slack_notified_terminal``; ``agent_error`` was not — operators would page twice on a single agent failure if a sibling channel happened to fail. Generalize the dedup mechanism. ``TERMINAL_EVENTS`` is replaced by a ``SLACK_DEDUP_ATTRIBUTE`` map that marks each event type with the ``channel_metadata`` attribute that should guard the post: - 5 terminals share ``slack_notified_terminal`` (any first-arriving terminal claims the right; subsequent terminals dedup against it) - ``agent_error`` gets its own ``slack_dispatched_agent_error`` so a duplicate agent_error doesn't reuse the terminal slot - ``task_created`` / ``session_started`` map to ``null`` because they already use the per-event ``slack_*_msg_ts`` conditional persists from review #1 — the conditional already provides full idempotency (a separate marker would be redundant) Also surfaces Slack's ``Retry-After`` header on rate-limited responses through a dedicated ``fanout.slack.retryable_api_error`` warn so operators reading CloudWatch can see the recovery window instead of guessing from sustained warn rate. Tests: - logs Retry-After header on rate-limited Slack responses (new): asserts ``retry_after_seconds`` propagates from Slack's response header into the warn metadata - existing terminal-codes parametrized test untouched (terminal branch doesn't read headers) - existing retryable test gains a ``headers: { get: () => null }`` stub on the fetch mock so the headers.get call doesn't crash Reviewer suggested a per-channel dispatch bitmap as the alternative. Rejected as premature: the duplicate-GitHub-PATCH is harmless (idempotent), Email is still a stub, and the dedup map covers the specific agent_error pain identified above. A bitmap would add a new table + IAM grants + per-dispatch DDB cost for a hypothetical problem (Slack rate-limiting AND a sibling channel failure).
…samples#79 test gap) Adds 4 tests covering the lifecycle-persist conditional path that review fix #1 introduced and review fix aws-samples#6 hardened. Pre-PR-aws-samples#79 the only ConditionalCheckFailed coverage was the terminal-dedup path; the new lifecycle-persist + dup-delete code lacked direct assertions and was flagged 9/10 criticality by the reviewer. - task_created persist ConditionalCheckFailed → posts duplicate then deletes it: pins the cleanup behaviour that prevents ghost task_created posts in the channel - session_started persist ConditionalCheckFailed → posts duplicate then deletes it: parallel coverage for the other lifecycle attribute (slack_session_msg_ts) - dup-delete failure emits fanout.slack.dup_delete_failed with error_id: pins the operator-alarm signal added in review fix aws-samples#6; asserts both the event key and the FANOUT_SLACK_DUP_DELETE_FAILED error_id propagate - chat.delete returning message_not_found is treated as success (no dup_delete_failed): negative-class assertion. Prevents false-positive alarms when the race resolves cleanly (the duplicate was already deleted by a prior retry). The ghost / message_not_found tests use ``fetchMock.mockImplementation`` URL-routing rather than ``.mockResolvedValueOnce`` chains because ``updateReaction`` issues 2-3 reaction-API fetches between chat.postMessage and chat.delete; routing by URL keeps the test focused on the load-bearing chat.delete behaviour without coupling to reaction call order.
…s#79 test gap aws-samples#32) Pre-PR-aws-samples#79 the new ``taskStrandedMessage`` and ``agentErrorMessage`` helpers in slack-blocks.ts had no direct unit tests. Reviewer flagged this as a 7/10 gap because the renderers carry the prior_status / error_type / message_preview metadata threaded through from the event source — silent drift in the metadata field names would produce ugly fallback messages in production. Adds 5 tests: - task_stranded WITH metadata renders the prior_status parenthetical (``Task stranded for org/repo (last status: RUNNING)``) so operators can tell at a glance whether the task hung in HYDRATING vs RUNNING — without the parenthetical the reviewer's "generic Event: ..." UX regression would resurface. - task_stranded WITHOUT metadata still renders cleanly (legacy events written before the reconciler started stamping metadata must not crash or leak ``undefined``). - agent_error with full metadata (error_type + message_preview) renders the rotating_light, type, and preview. - agent_error WITHOUT metadata stays sensible — no leaked ``undefined`` strings or empty ``_Type:_`` line. - agent_error truncates a 500-char message_preview to keep Slack channel UX readable.
…ples#79 test gap aws-samples#33) Pre-PR-aws-samples#79 review-fix aws-samples#4 there was no direct test for the ``slack_dispatched_agent_error`` dedup attribute or its interaction with the existing ``slack_notified_terminal`` slot. A future refactor that collapsed the two slots — or renamed one of them — would silently break the sibling-channel-failure-retry guarantee that fix aws-samples#4 added. Adds 4 tests: - ``agent_error claims its own dedup attribute``: pins the UpdateExpression and ConditionExpression strings so a refactor that renames the attribute breaks loudly. - ``agent_error retry hits the dedup guard``: end-to-end scenario matching review aws-samples#4 — task already has ``slack_dispatched_agent_error: true``, retry must short-circuit before chat.postMessage. Without the guard, a second rotating_light fires. - ``terminal dedup attribute is per-class``: a flaky task_completed-then-task_failed sequence dedups against the same ``slack_notified_terminal`` slot. Catches the regression where the orchestrator emits both terminal types and we'd otherwise post both ✅ and ❌. - ``agent_error and terminals use distinct dedup slots``: the important negative — having ``slack_dispatched_agent_error`` set must NOT shadow a subsequent ``task_completed``. Pins the slot separation so a future merge into a single slot can't silently drop terminals after an agent_error.
…es#79 test gap aws-samples#34) The construct shipped on issue aws-samples#64 with no unit-level coverage of its IAM contract. The only synth-level signal lived inside ``slack-integration.test.ts`` ("0 EventSourceMapping") which proved the migration didn't regress the OTHER construct. Reviewer flagged this 6/10 — and the gap is what allowed review aws-samples#2 (unconditional Slack secret grant) to slip through in the first place. Adds 6 tests: - ``attaches a single DynamoEventSource on the TaskEventsTable stream``: pins the architectural invariant — issue aws-samples#64 was fundamentally about reaching exactly-one stream reader. Adding a second consumer must fail this test loudly. - ``creates a DLQ for the fanout Lambda``: pins retention period + presence; a DLQ-less deployment would silently drop poison-pill records past retryAttempts. - ``omits the bgagent/slack/* grant when slackSecretArnPattern is not provided``: the review aws-samples#2 invariant. Iterates every IAM::Policy and asserts NONE of them grant secretsmanager:* on a bgagent/slack/* ARN. A regression that re-introduces the unconditional grant breaks this test. - ``attaches the bgagent/slack/* grant only when slackSecretArnPattern is provided``: the positive case. Pins the grant shape (action, effect, resource pattern). - ``passes TASK_TABLE_NAME env var when taskTable is provided``: review aws-samples#3 dependency — the dispatcher throws on missing env. - ``omits TASK_TABLE_NAME env var when taskTable is not provided``: graceful degrade for dev stacks that haven't onboarded the TaskTable yet (matches the construct's documented contract).
test gap aws-samples#35) The reconciler at handlers/reconcile-stranded-tasks.ts:170 emits BOTH ``task_stranded`` and ``task_failed`` for a heartbeat-expired task — one for the operator signal, one to drive the FAILED status transition. Pre-PR-aws-samples#79 this pair had no test coverage; reviewer flagged this 8/10 because the visible failure mode (a paired "Task stranded" + "Task failed" double-page in Slack) would surface in production but be silent in CI. Adds 2 tests: - ``task_stranded posts and writes the terminal dedup marker on first arrival``: pins that task_stranded participates in the shared terminal slot and renders the warning message with metadata. Catches a regression that omits task_stranded from the dedup map entirely. - ``task_stranded after a sibling task_failed dedups``: the operational scenario — task_failed already claimed ``slack_notified_terminal``; the subsequent task_stranded must short-circuit before chat.postMessage. Without this guard, operators get the double-page the reviewer warned about.
…an-message race Live observation during PR aws-samples#79 review verification: the same Slack @mention happy path sometimes leaves the 🚀 task_created message in the thread (orphaned beside the ✅ task_completed) and sometimes deletes it cleanly. The race window: 1. ``task_created`` stream batch posts the rocket message and persists ``slack_created_msg_ts`` via the conditional UpdateItem introduced in PR aws-samples#79 review fix #1. 2. ``task_completed`` stream batch fires ~30s later. Its initial GetItem races the prior UpdateItem and sees a stale ``channel_metadata`` WITHOUT ``slack_created_msg_ts``. 3. The terminal cleanup branch checks ``channelMeta.slack_created_msg_ts`` — undefined — silently skips the chat.delete. The rocket message stays in the thread. Add a fresh GetItem inside the TERMINAL_EVENTS cleanup branch, after the dedup UpdateItem has linearized our view of the table. Any prior ``slack_*_msg_ts`` writes are visible by then, so the cleanup fires correctly. On a re-read failure (DDB throttle / transient blip) we fall back to the dispatch-entry snapshot and emit ``fanout.slack.cleanup_reread_failed`` so operators can alarm on the rate. Pre-existing race (the unconditional UpdateItem in pre-PR-aws-samples#79 was the same shape — wrote, GetItem on the next batch could miss it). PR aws-samples#79 doesn't introduce it but doesn't fix it either; this commit does, since the live screenshot evidence appeared during review verification. Tests: - ``terminal cleanup re-reads TaskRecord``: scripts a stale dispatch-entry GetItem followed by a fresh re-read GetItem with ``slack_created_msg_ts`` present; asserts chat.delete fires against the freshly-read ts. - ``terminal cleanup falls back to dispatch-entry snapshot when re-read fails``: defense-in-depth — DDB throttle on the re-read must not break terminal delivery; cleanup uses the entry snapshot and emits the fallback warn.
|
Addressed all 10 findings + 5 test gaps. Each in its own commit citing the review item — 16 commits total on top of the original migration. Critical / High fixes (numbers refer to the review-comment ordering, not GitHub issues):
Test gaps:
Plus one extra: commit Verification:
Ready for re-review. |
Summary
Closes #64. Moves Slack outbound delivery off its own
TaskEventsTableDynamoDB Streams consumer ontoFanOutConsumeras a per-channel dispatcher. After this PR,TaskEventsTablehas exactly one stream reader (FanOutConsumer), restoring headroom for future channels (Email, Teams, etc.) without exceeding DynamoDB's documented 2-reader-per-shard practical limit.The single commit message walks through the migration (a) and four review-fix layers (b–f) — please read it before the diff for context. Quick map:
dispatchSlackEvent(), FanOut wires it in, IAM grant migrated, SlackNotifyFn + its DynamoEventSource deleted from SlackIntegration.task_stranded,agent_error.task_created/session_startedto prevent duplicate root messages on retry.pr_createdfrom Slack defaults (verified live; was visually duplicating withtask_completed's View PR button).Test plan
mise //cdk:compilecleanmise //cdk:test— 1183 / 1183 pass (8 net-new tests added for the review fixes)mise //cdk:eslintcleanmise //cdk:synthshows exactly oneAWS::Lambda::EventSourceMappingonTaskEventsTable, pointing atFanOutFnNotes
After deploying:
Should return exactly one ARN: the FanOutFn.