fix(webhooks): Eliminate head-of-line blocking in sequential mailbox drain by tnt-sentry · Pull Request #110215 · getsentry/sentry

tnt-sentry · 2026-03-09T16:39:37Z

Previously, drain_mailbox returned immediately when any single webhook delivery failed, blocking all subsequent messages in that mailbox until the next scheduled retry cycle (up to 3 minutes away). A single transient 500 or timeout could delay dozens of unrelated, healthy webhooks.

This changes drain_mailbox to skip failed messages and continue delivering the remaining messages in the mailbox — matching the behavior already present in the parallel drain path (drain_mailbox_parallel). Failed messages are individually rescheduled via the existing schedule_next_attempt() mechanism, so they will be retried on a future cycle.

A current_id variable now tracks the highest ID processed so far, ensuring that failed records (which stay in the database with a future schedule_for) are not re-fetched within the same drain invocation.

The ordering semantics change from strict to best-effort, which is safe because:

The parallel drain already uses best-effort ordering without issues
GitHub webhook types (push, PR, issue events) are idempotent
Region silos handle duplicate/out-of-order events

github-actions · 2026-03-09T16:48:11Z

Backend Test Failures

Failures on 1535f98 in this run:

tests/sentry/hybridcloud/tasks/test_deliver_webhooks.py::ScheduleWebhooksTest::test_schedule_mailbox_with_more_than_batch_size_records

— log

tests/sentry/hybridcloud/tasks/test_deliver_webhooks.py:124: in test_schedule_mailbox_with_more_than_batch_size_records
    assert len(responses.calls) == 1
E   assert 55 == 1
E    +  where 55 = len(<responses.CallList object at 0x7ff049047a10>)
E    +    where <responses.CallList object at 0x7ff049047a10> = responses.calls

markstory · 2026-03-09T20:29:48Z

src/sentry/hybridcloud/tasks/deliver_webhooks.py

+                    # Continue processing remaining messages instead of stopping.
+                    # Failed messages have already been rescheduled by deliver_message.
+                    continue


While github webhooks can generally be handled out of order ok, that isn't true of integrations like jira where ordering matters more as we can't be idempotent with the changes. Generally these integrations are lower volume and don't qualify for parallel delivery today. With this change though, those integrations could start seeing webhooks handled out of order.

Should we also track the integration providers that are hitting this path to validate that only idempotent integrations show up here? Or perhaps we have different behavior for different integrations?

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Missing flag prevents clearing the provider allowlist at runtime
- Added FLAG_ALLOW_EMPTY to hybridcloud.webhookpayload.skip_on_failure_providers so the system options API accepts an empty list and operators can clear the allowlist at runtime.

Or push these changes by commenting:

@cursor push 03e0e902ae

Preview (03e0e902ae)

diff --git a/src/sentry/options/defaults.py b/src/sentry/options/defaults.py
--- a/src/sentry/options/defaults.py
+++ b/src/sentry/options/defaults.py
@@ -2505,7 +2505,7 @@
     "hybridcloud.webhookpayload.skip_on_failure_providers",
     type=Sequence,
     default=["github"],
-    flags=FLAG_AUTOMATOR_MODIFIABLE,
+    flags=FLAG_ALLOW_EMPTY | FLAG_AUTOMATOR_MODIFIABLE,
 )
 
 # Break glass controls

_{This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.}

src/sentry/options/defaults.py

…drain Previously, drain_mailbox stopped processing the entire mailbox when any single message failed, blocking all subsequent webhooks until the next scheduled retry cycle (up to 3 minutes). This caused the p50 delivery latency to be dominated by transient failures in unrelated messages. Now drain_mailbox skips failed messages and continues delivering the remaining messages in the mailbox, matching the behavior already present in the parallel drain path. Failed messages are individually rescheduled by the existing schedule_next_attempt() mechanism. The current_id variable tracks progress through the mailbox so that failed records are not re-queried within the same drain invocation.

…havior The test expected drain_mailbox to stop after the first timeout failure. With the head-of-line blocking fix, all messages are now attempted, so update the response call count assertion to match the new behavior.

The skip-failed-messages behavior is only safe for providers that handle out-of-order delivery gracefully. Add a runtime-configurable option (hybridcloud.webhookpayload.skip_on_failure_providers) to control which providers opt in. github is enabled by default; all other providers retain strict stop-on-first-failure ordering.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-10T15:55:03Z

src/sentry/hybridcloud/tasks/deliver_webhooks.py

            # Fetch records from the batch in slices of 100. This avoids reading
            # redundant data should we hit an error and should help keep query duration low.
            query = WebhookPayloadReplica.filter(
-                id__gte=payload.id, mailbox_name=payload.mailbox_name


Deadline log omits new failed counter

Low Severity

The deliver_webhook.delivery_deadline log includes delivered but not the new failed counter, while the sibling delivery_complete_with_failures log does include it. Previously this didn't matter because any failure caused an immediate return, so the deadline path could never have failed > 0. With the new skip-on-failure behavior for allowlisted providers, the drain loop can now accumulate failures and hit the deadline, making this an observability gap for operators trying to understand drain behavior.

Additional Locations (1)

src/sentry/hybridcloud/tasks/deliver_webhooks.py#L342-L351

sentry · 2026-03-10T15:55:15Z

src/sentry/hybridcloud/tasks/deliver_webhooks.py

            ).order_by("id")

            batch_count = 0
            for record in query[:100]:
                batch_count += 1
+                # Advance past this record regardless of outcome so that failed
+                # messages are not re-attempted in subsequent batches of this drain.
+                current_id = record.id + 1
                # Refresh the lock on each delivery so a slow HTTP response in the
                # inner loop (up to 30s timeout × 100 records) cannot outlast the
                # 15s TTL and let the key expire mid-batch.


Bug: When a webhook at the head of a mailbox fails and enters backoff, schedule_webhook_delivery and maybe_trigger_drain will not process the mailbox, blocking all subsequent ready messages.
_{Severity: HIGH}

Suggested Fix

Modify schedule_webhook_delivery and maybe_trigger_drain to handle mailboxes where the head is in backoff for skip_on_failure providers. Instead of only checking the head message, the logic should check for any ready message in the mailbox. This would allow the drain process to be triggered, which can then skip the backed-off head message and process the subsequent ready ones.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/sentry/hybridcloud/tasks/deliver_webhooks.py#L309-L323 Potential issue: For providers with `skip_on_failure=True`, if the message at the head of a mailbox fails, it is rescheduled for a future attempt. However, both the scheduler (`schedule_webhook_delivery`) and the push-trigger (`maybe_trigger_drain`) only check if the head message is ready (`schedule_for <= now()`) before processing a mailbox. If the head message is in a backoff period, the entire mailbox is skipped. This introduces a new form of head-of-line blocking between drain invocations, causing all subsequent, ready-to-deliver messages in that mailbox to be stuck until the head's backoff period expires, which can be up to 60 minutes.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Mar 9, 2026

vercel bot deployed to Preview March 9, 2026 17:01 View deployment

tnt-sentry marked this pull request as ready for review March 9, 2026 20:04

tnt-sentry requested a review from a team as a code owner March 9, 2026 20:04

markstory approved these changes Mar 9, 2026

View reviewed changes

vercel bot deployed to Preview March 10, 2026 14:12 View deployment

cursor bot reviewed Mar 10, 2026

View reviewed changes

src/sentry/options/defaults.py Show resolved Hide resolved

markstory approved these changes Mar 10, 2026

View reviewed changes

src/sentry/options/defaults.py Show resolved Hide resolved

vercel bot deployed to Preview March 10, 2026 15:14 View deployment

tnt-sentry added 3 commits March 10, 2026 11:18

tnt-sentry force-pushed the fix/drain-mailbox-head-of-line-blocking branch from 0a0a6fc to 415df3d Compare March 10, 2026 15:49

tnt-sentry enabled auto-merge (squash) March 10, 2026 15:49

vercel bot deployed to Preview March 10, 2026 15:51 View deployment

cursor bot reviewed Mar 10, 2026

View reviewed changes

sentry bot reviewed Mar 10, 2026

View reviewed changes

tnt-sentry merged commit 65488d2 into master Mar 10, 2026
54 of 55 checks passed

tnt-sentry deleted the fix/drain-mailbox-head-of-line-blocking branch March 10, 2026 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(webhooks): Eliminate head-of-line blocking in sequential mailbox drain#110215

fix(webhooks): Eliminate head-of-line blocking in sequential mailbox drain#110215
tnt-sentry merged 3 commits intomasterfrom
fix/drain-mailbox-head-of-line-blocking

tnt-sentry commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

markstory Mar 9, 2026

Uh oh!

cursor bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 10, 2026

Uh oh!

sentry bot Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tnt-sentry commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026

Backend Test Failures

Uh oh!

markstory Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 10, 2026

Choose a reason for hiding this comment

Deadline log omits new failed counter

Uh oh!

sentry bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cursor bot left a comment •

edited

Loading

Deadline log omits new `failed` counter