fix(preprod): correctly attribute rate limiting errors#106580
fix(preprod): correctly attribute rate limiting errors#106580
Conversation
| except ApiForbiddenError as e: | ||
| lifecycle.record_failure(e) | ||
| error_message = str(e).lower() | ||
| # Github uses 403 codes for some rate limiting errors which are captured as ApiForbiddenError |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| namespace=preprod_tasks, | ||
| processing_deadline_duration=30, | ||
| retry=Retry(times=3, delay=60, ignore=(IntegrationConfigurationError,)), | ||
| retry=Retry(times=3, delay=60, on=(ApiRateLimitedError,)), |
There was a problem hiding this comment.
Bug: The change from ignore to on in the Retry configuration prevents retries on transient network errors like timeouts, allowing retries only for ApiRateLimitedError.
Severity: HIGH
Suggested Fix
Add common transient API errors to the on tuple in the Retry configuration. For example: retry=Retry(times=3, delay=60, on=(ApiRateLimitedError, ApiTimeoutError, ApiHostError, ApiConnectionResetError, ApiRetryError)). This will restore the previous behavior of retrying on temporary network failures.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/sentry/preprod/vcs/status_checks/size/tasks.py#L72
Potential issue: The task's retry configuration was changed from a denylist approach
using `ignore=(IntegrationConfigurationError,)` to an allowlist approach with
`on=(ApiRateLimitedError,)`. This change means that transient network errors such as
`ApiTimeoutError`, `ApiHostError`, and `ApiConnectionResetError` will no longer trigger
a retry. Previously, these errors would be retried. With the new code, any temporary
network issue will cause the task to fail immediately and permanently, which could lead
to missed status checks in production.
Did we get this right? 👍 / 👎 to inform future reviews.
Ultimately doesn't change any behavior since we were still retrying off the old error type, but should let us better attribute how often rate limiting is happening. In our backend logs you can see these come back as 403 which triggers some ApiForbidden error logic in our base integrations code: