Skip to content

feat(slack): Create fallback incident channel on DB failure#176

Merged
rgibert merged 22 commits into
mainfrom
rgibert/db-fallback-channel
May 8, 2026
Merged

feat(slack): Create fallback incident channel on DB failure#176
rgibert merged 22 commits into
mainfrom
rgibert/db-fallback-channel

Conversation

@rgibert

@rgibert rgibert commented May 7, 2026

Copy link
Copy Markdown
Member

When serializer.save() fails during incident creation from the Slack modal (DB unreachable, or any other error), we now create a fallback Slack channel named inc-<uuid[:8]> so teams can coordinate immediately instead of being told to create one manually.

The fallback replicates as much of the normal on_incident_created flow as possible without DB access:

  • Creates the incident channel with a random hash name
  • Posts and pins a structured metadata message with all modal form fields
    (title, severity, description, tags, captain, etc.) for later backfill
  • Posts a degraded-mode warning explaining the channel needs backfill
  • Posts the incident guide message
  • Invites captain, reporter, and always-invited users
  • Creates a Datadog notebook (non-private only)
  • Creates a Notion troubleshooting doc (non-private only)
  • Creates a status channel for P0/P1 (non-private only)
  • Invites oncall users for P0/P1
  • Pages PagerDuty for P0/P1 with the Slack channel link
  • Posts to the incident feed channel (non-private only)
  • DMs the submitting user with the channel link

The DB-dependent parts are skipped: no Incident row, no ExternalLink dedup, no Firetower URL. Backfill of orphaned inc-* channels will be handled separately.

Also adds post_message_return_ts and pin_message methods to SlackService to support pinning the metadata message.

Agent transcript: https://claudescope.sentry.dev/share/SZmgtrAzIjJHMrC-H8EqX1AhpwsgNQYghS8s-795rR0

From testing:
image

rgibert added 5 commits May 7, 2026 10:51
Add post_message_return_ts as a variant of post_message that returns
the message timestamp instead of a boolean, enabling callers to
reference the message for threading or pinning.

Co-Authored-By: Claude <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/zmqQ1m0NJE_-FuqvcuXZ18ZttUQr0XtFVr02YLG2tyw
Add pin_message to pin a message in a Slack channel by timestamp,
returning a bool to indicate success. This will be used alongside
post_message_return_ts in the fallback channel creation flow.

Co-Authored-By: Claude <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/8XVdAzyoTtZahrYYEo_XJD2cZgNAOzObwuOpS5oVA4Y
Comment thread src/firetower/slack_app/handlers/new_incident.py
Comment thread src/firetower/slack_app/handlers/new_incident.py Outdated
rgibert added 3 commits May 7, 2026 11:18
post_message now returns the message timestamp (str | None) instead of
bool, making post_message_return_ts redundant. All callers that discarded
the return value are unaffected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/HTzewExf7sMcwUKuYVYdLDdeJTjwtlCMFemciQiB6ho
Both the normal DB-backed and fallback (DB-unreachable) incident creation
paths now call the same decorate_incident_channel() function for channel
setup steps (guide message, DD notebook, Notion doc, IC mention,
description, user invites, oncall invites, status channel, feed post).
This eliminates ~250 lines of duplicated logic in _create_fallback_channel
and prevents drift between the two paths.

Primitives-only helpers (page_for_channel, _invite_oncall_to_channel,
_create_status_channel_for_context) accept a SlackService via dependency
injection so callers can pass their own instance, keeping existing test
mocks working without dual-mock complexity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/Pi2pox5vOj8iMXFMdUt1noXDBVhcFwiMUZY98SQGc3E
Comment thread src/firetower/slack_app/handlers/new_incident.py
…efactor

The decorate_incident_channel extraction (e184792) moved oncall invite
and status channel creation into the shared orchestrator, which calls
_invite_oncall_to_channel and _create_status_channel_for_context
directly. Tests for on_incident_created were still patching the
higher-level wrappers (_invite_oncall_users, _create_status_channel)
that are only used by on_severity_changed, so the mocks had no effect.

Co-Authored-By: Claude <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/Is4vKBCpx9IMlodDre7fniFgvTC_49tk6w6Djl51QW8
Comment thread src/firetower/integrations/services/slack.py
…hannel

$ Conflicts:
$	src/firetower/incidents/hooks.py
$	src/firetower/slack_app/handlers/new_incident.py

Agent transcript: https://claudescope.sentry.dev/share/wOfmVrIupzZyuS1QphAL0rmTC5f-Kavnrplig2a0QnA
@rgibert rgibert marked this pull request as ready for review May 8, 2026 13:48
@rgibert rgibert requested a review from a team as a code owner May 8, 2026 13:48
Comment thread src/firetower/incidents/hooks.py
Comment thread src/firetower/slack_app/handlers/new_incident.py
Move Datadog/Notion creation after decorate_incident_channel in the
normal path so the guide message appears first in the channel.

Narrow the fallback channel trigger to OperationalError only, so
non-DB failures (e.g. hook errors after incident is already persisted)
send an error DM instead of creating a misleading fallback channel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/SyhTFocRgYXsz1EyKb08o_I-sJXfpjypi8yKYXWL1ZM
Comment thread src/firetower/slack_app/handlers/new_incident.py
Comment thread src/firetower/slack_app/handlers/new_incident.py
rgibert and others added 4 commits May 8, 2026 10:27
The fallback channel logic only protected serializer.save(), but
get_or_create_user_from_slack_id and serializer.is_valid() also
hit the DB. If the DB was down, those calls would raise
OperationalError before the try block, so the fallback never fired.

Extract _create_incident_via_db to group all DB work under one
OperationalError handler so the fallback triggers on any DB failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/gKzMkOx7EVyxHv3oPk4TlTJo9JI-lLl3kZL6HFmKgiQ
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rgibert rgibert self-assigned this May 8, 2026
rgibert added 3 commits May 8, 2026 11:55
Cover creating a test Slack app from the manifest, collecting bot and
app-level tokens, configuring the feed channel, and running the bot
in Socket Mode.

Co-Authored-By: Claude <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/sQNA-c-iG1pDnBUKoQ_oYDQkecjc2-cKqbCGDd2SQf4
Comment thread src/firetower/incidents/hooks.py Outdated
Comment thread src/firetower/incidents/hooks.py
… DD/Notion creation

The Notion API rejects empty strings for url properties with a 400
error. In the DB-outage fallback path, incident_url was coerced from
None to "" causing silent Notion page creation failures.

Also extracts shared Datadog notebook and Notion troubleshooting doc
creation logic into _do_create_datadog_notebook and
_do_create_troubleshooting_doc helpers, eliminating duplication between
the fallback path in decorate_incident_channel and the DB-dedup
wrappers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/Y16D4_HD176UY4rSz1vnRvGzmp2bZSvXEwBTc-Bt2-M
Comment thread src/firetower/incidents/hooks.py
Comment thread src/firetower/incidents/hooks.py Outdated
…creation

Split _do_create_datadog_notebook into API-only creation and a separate
_notify_datadog_notebook for Slack bookmark/message posting. The DB-dedup
path calls the notification after the transaction commits, restoring the
original design that avoids holding the SELECT FOR UPDATE row lock during
Slack API round-trips.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/TtWM4yO2EjgDdTzfeXgVUvkZ6z_al9mKUnCUpoVaT0o
Comment thread src/firetower/incidents/hooks.py Outdated
…ooting doc

Same regression as the Datadog notebook fix: _do_create_troubleshooting_doc
bundled Notion API calls with Slack notifications, causing bookmark and
message to fire before the ExternalLink URL was committed. If the
subsequent transaction failed, Slack would reference a doc that Firetower
had no record of.

Split into _do_create_troubleshooting_doc (API only) and
_notify_troubleshooting_doc (Slack bookmark + message), matching the
Datadog notebook pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/SXYWlDsb9OYt_zu_5j71IZ9Yab_Uz9g-Kn_myaoGRf8
Comment thread src/firetower/slack_app/handlers/new_incident.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit fb58bf1. Configure here.

Comment thread src/firetower/slack_app/handlers/new_incident.py Outdated
Comment thread src/firetower/slack_app/handlers/new_incident.py
PostgreSQL raises InterfaceError (not just OperationalError) when a
previously-established connection drops during failover. Catch both so
the fallback channel is created in either case.

Also escape user-provided title, description, and impact_summary in the
fallback metadata message to prevent Slack markup injection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/DxUhIF-6Ihcby1dMJoyuOmOlLEG-ROMhJDQKb76H6rg
@rgibert rgibert enabled auto-merge (squash) May 8, 2026 18:29
Comment thread docker-compose.yml
# TODO: These shouldn't be in frontend/ ?
env_file: frontend/.env

slack-bot:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@rgibert rgibert merged commit f853417 into main May 8, 2026
24 checks passed
@rgibert rgibert deleted the rgibert/db-fallback-channel branch May 8, 2026 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants