fix(webhooks): Eliminate head-of-line blocking in sequential mailbox drain#110215
fix(webhooks): Eliminate head-of-line blocking in sequential mailbox drain#110215tnt-sentry merged 3 commits intomasterfrom
Conversation
Backend Test FailuresFailures on
|
| # Continue processing remaining messages instead of stopping. | ||
| # Failed messages have already been rescheduled by deliver_message. | ||
| continue |
There was a problem hiding this comment.
While github webhooks can generally be handled out of order ok, that isn't true of integrations like jira where ordering matters more as we can't be idempotent with the changes. Generally these integrations are lower volume and don't qualify for parallel delivery today. With this change though, those integrations could start seeing webhooks handled out of order.
Should we also track the integration providers that are hitting this path to validate that only idempotent integrations show up here? Or perhaps we have different behavior for different integrations?
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Missing flag prevents clearing the provider allowlist at runtime
- Added
FLAG_ALLOW_EMPTYtohybridcloud.webhookpayload.skip_on_failure_providersso the system options API accepts an empty list and operators can clear the allowlist at runtime.
- Added
Or push these changes by commenting:
@cursor push 03e0e902ae
Preview (03e0e902ae)
diff --git a/src/sentry/options/defaults.py b/src/sentry/options/defaults.py
--- a/src/sentry/options/defaults.py
+++ b/src/sentry/options/defaults.py
@@ -2505,7 +2505,7 @@
"hybridcloud.webhookpayload.skip_on_failure_providers",
type=Sequence,
default=["github"],
- flags=FLAG_AUTOMATOR_MODIFIABLE,
+ flags=FLAG_ALLOW_EMPTY | FLAG_AUTOMATOR_MODIFIABLE,
)
# Break glass controls…drain Previously, drain_mailbox stopped processing the entire mailbox when any single message failed, blocking all subsequent webhooks until the next scheduled retry cycle (up to 3 minutes). This caused the p50 delivery latency to be dominated by transient failures in unrelated messages. Now drain_mailbox skips failed messages and continues delivering the remaining messages in the mailbox, matching the behavior already present in the parallel drain path. Failed messages are individually rescheduled by the existing schedule_next_attempt() mechanism. The current_id variable tracks progress through the mailbox so that failed records are not re-queried within the same drain invocation.
…havior The test expected drain_mailbox to stop after the first timeout failure. With the head-of-line blocking fix, all messages are now attempted, so update the response call count assertion to match the new behavior.
The skip-failed-messages behavior is only safe for providers that handle out-of-order delivery gracefully. Add a runtime-configurable option (hybridcloud.webhookpayload.skip_on_failure_providers) to control which providers opt in. github is enabled by default; all other providers retain strict stop-on-first-failure ordering.
0a0a6fc to
415df3d
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| # Fetch records from the batch in slices of 100. This avoids reading | ||
| # redundant data should we hit an error and should help keep query duration low. | ||
| query = WebhookPayloadReplica.filter( | ||
| id__gte=payload.id, mailbox_name=payload.mailbox_name |
There was a problem hiding this comment.
Deadline log omits new failed counter
Low Severity
The deliver_webhook.delivery_deadline log includes delivered but not the new failed counter, while the sibling delivery_complete_with_failures log does include it. Previously this didn't matter because any failure caused an immediate return, so the deadline path could never have failed > 0. With the new skip-on-failure behavior for allowlisted providers, the drain loop can now accumulate failures and hit the deadline, making this an observability gap for operators trying to understand drain behavior.
Additional Locations (1)
| ).order_by("id") | ||
|
|
||
| batch_count = 0 | ||
| for record in query[:100]: | ||
| batch_count += 1 | ||
| # Advance past this record regardless of outcome so that failed | ||
| # messages are not re-attempted in subsequent batches of this drain. | ||
| current_id = record.id + 1 | ||
| # Refresh the lock on each delivery so a slow HTTP response in the | ||
| # inner loop (up to 30s timeout × 100 records) cannot outlast the | ||
| # 15s TTL and let the key expire mid-batch. |
There was a problem hiding this comment.
Bug: When a webhook at the head of a mailbox fails and enters backoff, schedule_webhook_delivery and maybe_trigger_drain will not process the mailbox, blocking all subsequent ready messages.
Severity: HIGH
Suggested Fix
Modify schedule_webhook_delivery and maybe_trigger_drain to handle mailboxes where the head is in backoff for skip_on_failure providers. Instead of only checking the head message, the logic should check for any ready message in the mailbox. This would allow the drain process to be triggered, which can then skip the backed-off head message and process the subsequent ready ones.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/sentry/hybridcloud/tasks/deliver_webhooks.py#L309-L323
Potential issue: For providers with `skip_on_failure=True`, if the message at the head
of a mailbox fails, it is rescheduled for a future attempt. However, both the scheduler
(`schedule_webhook_delivery`) and the push-trigger (`maybe_trigger_drain`) only check if
the head message is ready (`schedule_for <= now()`) before processing a mailbox. If the
head message is in a backoff period, the entire mailbox is skipped. This introduces a
new form of head-of-line blocking between drain invocations, causing all subsequent,
ready-to-deliver messages in that mailbox to be stuck until the head's backoff period
expires, which can be up to 60 minutes.
Did we get this right? 👍 / 👎 to inform future reviews.



Previously,
drain_mailboxreturned immediately when any single webhook delivery failed, blocking all subsequent messages in that mailbox until the next scheduled retry cycle (up to 3 minutes away). A single transient 500 or timeout could delay dozens of unrelated, healthy webhooks.This changes
drain_mailboxto skip failed messages and continue delivering the remaining messages in the mailbox — matching the behavior already present in the parallel drain path (drain_mailbox_parallel). Failed messages are individually rescheduled via the existingschedule_next_attempt()mechanism, so they will be retried on a future cycle.A
current_idvariable now tracks the highest ID processed so far, ensuring that failed records (which stay in the database with a futureschedule_for) are not re-fetched within the same drain invocation.The ordering semantics change from strict to best-effort, which is safe because: