fix(preprod): Eliminate race condition in snapshot status check posting#115650
Conversation
6dabc66 to
3571d94
Compare
45be3cd to
152be68
Compare
| "caller": "upload_completion", | ||
| }, | ||
| ) | ||
| create_preprod_snapshot_pr_comment_task.apply_async( |
There was a problem hiding this comment.
nit: Any reason these are different tasks? I thought we would have a similar update_vcs task that would do both the status check + comment in one go.
There was a problem hiding this comment.
yeah that's a great idea. let me take a stab at that and ill try to incorporate it into this PR before merging
| "Snapshot comparison artifact not found", | ||
| extra={"head_artifact_id": head_artifact_id, "base_artifact_id": base_artifact_id}, | ||
| ) | ||
| create_preprod_snapshot_status_check_task.apply_async( |
There was a problem hiding this comment.
Were these just spots you noticed were missing checks, unrelated to the issue here?
There was a problem hiding this comment.
Also somewhat highlights what I mean about combining the VCS tasks.
There was a problem hiding this comment.
Were these just spots you noticed were missing checks, unrelated to the issue here?
precisely
Consolidate status check ownership so compare_snapshots owns the full IN_PROGRESS → SUCCESS/FAILURE lifecycle when a comparison is starting. The upload endpoint no longer fires its own status check in that case. Also add a staleness guard in post_snapshot_status_check_task: before posting IN_PROGRESS, re-check the DB and skip if the comparison has already reached a terminal state. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…pshots The PreprodArtifact.DoesNotExist and PreprodSnapshotMetrics.DoesNotExist handlers returned without posting a terminal status check. Since the upload endpoint now skips its own status check when a comparison is starting, these paths left the GitHub check stuck at "Processing." Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
…are_snapshots The artifact-not-found and metrics-not-found exits were posting the status check but not the PR comment, leaving it stale. Co-Authored-By: Claude <noreply@anthropic.com>
…ot_vcs Extract a single entry point for dispatching both the status check and PR comment tasks. All 10 paired call sites across the codebase now go through this function, eliminating duplicated dispatch blocks and ensuring consistent behavior when the kwargs or dispatch pattern change. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5149ef5 to
d9e69e5
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e3ef59a. Configure here.
…in compare_snapshots
| ) | ||
|
|
||
| update_preprod_snapshot_vcs( | ||
| preprod_artifact_id=head_artifact_id, | ||
| caller="compare_start", | ||
| update_pr_comment=False, | ||
| ) | ||
|
|
||
| try: | ||
| head_artifact = PreprodArtifact.objects.select_related("project__organization").get( | ||
| id=head_artifact_id, |
There was a problem hiding this comment.
Bug: A retried compare_snapshots task can leave a GitHub status check permanently stuck in an in-progress state if the comparison was already processing.
Severity: MEDIUM
Suggested Fix
Update the compare_snapshots task to handle the PROCESSING state. When a retried task finds a comparison in the PROCESSING state, it should avoid posting a new IN_PROGRESS status and exit, or it should be responsible for eventually posting a terminal status. Alternatively, the staleness guard could be updated to include PROCESSING as a state that prevents a new IN_PROGRESS status from being posted.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/sentry/preprod/snapshots/tasks.py#L349-L360
Potential issue: If a `compare_snapshots` task is retried for a comparison that is
already in the `PROCESSING` state, for instance, due to a prior worker being killed, it
posts a new `IN_PROGRESS` status check to GitHub. However, the task then exits early
without posting a terminal status. The staleness guard in
`post_snapshot_status_check_task` does not account for the `PROCESSING` state, only
`SUCCESS` or `FAILED`. This results in the GitHub check becoming permanently stuck in a
processing state, as no terminal status is ever posted.
Prevents posting an orphaned IN_PROGRESS status when a retried task finds the comparison already being processed by another worker.

Summary
Fixes a race condition where a snapshot GitHub status check gets permanently stuck at "Processing."
Root cause: When a snapshot is uploaded and a base exists, two independent paths fire status check posts:
create_preprod_snapshot_status_check_task(computes IN_PROGRESS)compare_snapshots, which on completion enqueues another status check post (computes SUCCESS/FAILURE)Since
post_snapshot_status_check_taskalways creates a new GitHub check run (POST, not PATCH), Celery task ordering determines which one GitHub shows last. If the IN_PROGRESS post lands after the COMPLETED post, the check reverts to "Processing" permanently.Changes
1. Structural fix — consolidate status check ownership (
preprod_artifact_snapshot.py)When a comparison is starting, the upload endpoint no longer fires its own status check. Instead,
compare_snapshotsowns the full lifecycle: it posts IN_PROGRESS at the start and SUCCESS/FAILURE on completion. The upload endpoint only posts a status check when no comparison will happen (no base artifact, first upload, etc.).2. Handle early failures in
compare_snapshots(snapshots/tasks.py)compare_snapshots, before any DB lookups, so even if artifact/metrics lookup fails the check is posted.JSONDecodeError,RequestError,ValidationError,TypeError), which previously returned silently leaving the check stuck.3. Staleness guard as safety net (
status_checks/snapshots/tasks.py)Belt-and-suspenders: before
post_snapshot_status_check_taskposts an IN_PROGRESS status, it re-queries the DB to check if the artifact's comparison has already reached a terminal state (SUCCESS or FAILED). If so, the post is skipped — the comparison result is authoritative.Test plan
test_skips_stale_in_progress_when_comparison_succeeded— verifies guard skips posting when a completed comparison existstest_allows_in_progress_when_comparison_still_pending— verifies legitimate IN_PROGRESS posts still go through