Skip to content

fix(webhooks): Eliminate head-of-line blocking in sequential mailbox drain#110215

Merged
tnt-sentry merged 3 commits intomasterfrom
fix/drain-mailbox-head-of-line-blocking
Mar 10, 2026
Merged

fix(webhooks): Eliminate head-of-line blocking in sequential mailbox drain#110215
tnt-sentry merged 3 commits intomasterfrom
fix/drain-mailbox-head-of-line-blocking

Conversation

@tnt-sentry
Copy link
Contributor

Previously, drain_mailbox returned immediately when any single webhook delivery failed, blocking all subsequent messages in that mailbox until the next scheduled retry cycle (up to 3 minutes away). A single transient 500 or timeout could delay dozens of unrelated, healthy webhooks.

This changes drain_mailbox to skip failed messages and continue delivering the remaining messages in the mailbox — matching the behavior already present in the parallel drain path (drain_mailbox_parallel). Failed messages are individually rescheduled via the existing schedule_next_attempt() mechanism, so they will be retried on a future cycle.

A current_id variable now tracks the highest ID processed so far, ensuring that failed records (which stay in the database with a future schedule_for) are not re-fetched within the same drain invocation.

The ordering semantics change from strict to best-effort, which is safe because:

  • The parallel drain already uses best-effort ordering without issues
  • GitHub webhook types (push, PR, issue events) are idempotent
  • Region silos handle duplicate/out-of-order events

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Mar 9, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

Backend Test Failures

Failures on 1535f98 in this run:

tests/sentry/hybridcloud/tasks/test_deliver_webhooks.py::ScheduleWebhooksTest::test_schedule_mailbox_with_more_than_batch_size_recordslog
tests/sentry/hybridcloud/tasks/test_deliver_webhooks.py:124: in test_schedule_mailbox_with_more_than_batch_size_records
    assert len(responses.calls) == 1
E   assert 55 == 1
E    +  where 55 = len(<responses.CallList object at 0x7ff049047a10>)
E    +    where <responses.CallList object at 0x7ff049047a10> = responses.calls

@tnt-sentry tnt-sentry marked this pull request as ready for review March 9, 2026 20:04
@tnt-sentry tnt-sentry requested a review from a team as a code owner March 9, 2026 20:04
Comment on lines +327 to +329
# Continue processing remaining messages instead of stopping.
# Failed messages have already been rescheduled by deliver_message.
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While github webhooks can generally be handled out of order ok, that isn't true of integrations like jira where ordering matters more as we can't be idempotent with the changes. Generally these integrations are lower volume and don't qualify for parallel delivery today. With this change though, those integrations could start seeing webhooks handled out of order.

Should we also track the integration providers that are hitting this path to validate that only idempotent integrations show up here? Or perhaps we have different behavior for different integrations?

Copy link
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Missing flag prevents clearing the provider allowlist at runtime
    • Added FLAG_ALLOW_EMPTY to hybridcloud.webhookpayload.skip_on_failure_providers so the system options API accepts an empty list and operators can clear the allowlist at runtime.

Create PR

Or push these changes by commenting:

@cursor push 03e0e902ae
Preview (03e0e902ae)
diff --git a/src/sentry/options/defaults.py b/src/sentry/options/defaults.py
--- a/src/sentry/options/defaults.py
+++ b/src/sentry/options/defaults.py
@@ -2505,7 +2505,7 @@
     "hybridcloud.webhookpayload.skip_on_failure_providers",
     type=Sequence,
     default=["github"],
-    flags=FLAG_AUTOMATOR_MODIFIABLE,
+    flags=FLAG_ALLOW_EMPTY | FLAG_AUTOMATOR_MODIFIABLE,
 )
 
 # Break glass controls
This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

…drain

Previously, drain_mailbox stopped processing the entire mailbox when any
single message failed, blocking all subsequent webhooks until the next
scheduled retry cycle (up to 3 minutes). This caused the p50 delivery
latency to be dominated by transient failures in unrelated messages.

Now drain_mailbox skips failed messages and continues delivering the
remaining messages in the mailbox, matching the behavior already present
in the parallel drain path. Failed messages are individually rescheduled
by the existing schedule_next_attempt() mechanism.

The current_id variable tracks progress through the mailbox so that
failed records are not re-queried within the same drain invocation.
…havior

The test expected drain_mailbox to stop after the first timeout failure.
With the head-of-line blocking fix, all messages are now attempted, so
update the response call count assertion to match the new behavior.
The skip-failed-messages behavior is only safe for providers that handle
out-of-order delivery gracefully. Add a runtime-configurable option
(hybridcloud.webhookpayload.skip_on_failure_providers) to control which
providers opt in. github is enabled by default; all other providers
retain strict stop-on-first-failure ordering.
@tnt-sentry tnt-sentry force-pushed the fix/drain-mailbox-head-of-line-blocking branch from 0a0a6fc to 415df3d Compare March 10, 2026 15:49
@tnt-sentry tnt-sentry enabled auto-merge (squash) March 10, 2026 15:49
Copy link
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

# Fetch records from the batch in slices of 100. This avoids reading
# redundant data should we hit an error and should help keep query duration low.
query = WebhookPayloadReplica.filter(
id__gte=payload.id, mailbox_name=payload.mailbox_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deadline log omits new failed counter

Low Severity

The deliver_webhook.delivery_deadline log includes delivered but not the new failed counter, while the sibling delivery_complete_with_failures log does include it. Previously this didn't matter because any failure caused an immediate return, so the deadline path could never have failed > 0. With the new skip-on-failure behavior for allowlisted providers, the drain loop can now accumulate failures and hit the deadline, making this an observability gap for operators trying to understand drain behavior.

Additional Locations (1)

Fix in Cursor Fix in Web

Comment on lines 313 to 323
).order_by("id")

batch_count = 0
for record in query[:100]:
batch_count += 1
# Advance past this record regardless of outcome so that failed
# messages are not re-attempted in subsequent batches of this drain.
current_id = record.id + 1
# Refresh the lock on each delivery so a slow HTTP response in the
# inner loop (up to 30s timeout × 100 records) cannot outlast the
# 15s TTL and let the key expire mid-batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: When a webhook at the head of a mailbox fails and enters backoff, schedule_webhook_delivery and maybe_trigger_drain will not process the mailbox, blocking all subsequent ready messages.
Severity: HIGH

Suggested Fix

Modify schedule_webhook_delivery and maybe_trigger_drain to handle mailboxes where the head is in backoff for skip_on_failure providers. Instead of only checking the head message, the logic should check for any ready message in the mailbox. This would allow the drain process to be triggered, which can then skip the backed-off head message and process the subsequent ready ones.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/sentry/hybridcloud/tasks/deliver_webhooks.py#L309-L323

Potential issue: For providers with `skip_on_failure=True`, if the message at the head
of a mailbox fails, it is rescheduled for a future attempt. However, both the scheduler
(`schedule_webhook_delivery`) and the push-trigger (`maybe_trigger_drain`) only check if
the head message is ready (`schedule_for <= now()`) before processing a mailbox. If the
head message is in a backoff period, the entire mailbox is skipped. This introduces a
new form of head-of-line blocking between drain invocations, causing all subsequent,
ready-to-deliver messages in that mailbox to be stuck until the head's backoff period
expires, which can be up to 60 minutes.

Did we get this right? 👍 / 👎 to inform future reviews.

@tnt-sentry tnt-sentry merged commit 65488d2 into master Mar 10, 2026
54 of 55 checks passed
@tnt-sentry tnt-sentry deleted the fix/drain-mailbox-head-of-line-blocking branch March 10, 2026 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants