Skip to content

fix(preprod): Eliminate race condition in snapshot status check posting#115650

Merged
NicoHinderling merged 8 commits into
masterfrom
nico/fix-snapshot-status-check-race
May 19, 2026
Merged

fix(preprod): Eliminate race condition in snapshot status check posting#115650
NicoHinderling merged 8 commits into
masterfrom
nico/fix-snapshot-status-check-race

Conversation

@NicoHinderling
Copy link
Copy Markdown
Contributor

@NicoHinderling NicoHinderling commented May 15, 2026

Summary

Fixes a race condition where a snapshot GitHub status check gets permanently stuck at "Processing."

Root cause: When a snapshot is uploaded and a base exists, two independent paths fire status check posts:

  1. The upload endpoint enqueues create_preprod_snapshot_status_check_task (computes IN_PROGRESS)
  2. The upload endpoint also enqueues compare_snapshots, which on completion enqueues another status check post (computes SUCCESS/FAILURE)

Since post_snapshot_status_check_task always creates a new GitHub check run (POST, not PATCH), Celery task ordering determines which one GitHub shows last. If the IN_PROGRESS post lands after the COMPLETED post, the check reverts to "Processing" permanently.

Changes

1. Structural fix — consolidate status check ownership (preprod_artifact_snapshot.py)

When a comparison is starting, the upload endpoint no longer fires its own status check. Instead, compare_snapshots owns the full lifecycle: it posts IN_PROGRESS at the start and SUCCESS/FAILURE on completion. The upload endpoint only posts a status check when no comparison will happen (no base artifact, first upload, etc.).

2. Handle early failures in compare_snapshots (snapshots/tasks.py)

  • Post the IN_PROGRESS status check at the very top of compare_snapshots, before any DB lookups, so even if artifact/metrics lookup fails the check is posted.
  • Add status check + PR comment posts to the manifest parse error handler (JSONDecodeError, RequestError, ValidationError, TypeError), which previously returned silently leaving the check stuck.

3. Staleness guard as safety net (status_checks/snapshots/tasks.py)

Belt-and-suspenders: before post_snapshot_status_check_task posts an IN_PROGRESS status, it re-queries the DB to check if the artifact's comparison has already reached a terminal state (SUCCESS or FAILED). If so, the post is skipped — the comparison result is authoritative.

Test plan

  • test_skips_stale_in_progress_when_comparison_succeeded — verifies guard skips posting when a completed comparison exists
  • test_allows_in_progress_when_comparison_still_pending — verifies legitimate IN_PROGRESS posts still go through
  • All existing tests pass (32 total in test file)

@NicoHinderling NicoHinderling requested a review from a team as a code owner May 15, 2026 17:25
@github-actions github-actions Bot added the Scope: Backend Automatically applied to PRs that change backend components label May 15, 2026
@NicoHinderling NicoHinderling force-pushed the nico/fix-snapshot-status-check-race branch from 6dabc66 to 3571d94 Compare May 15, 2026 18:13
Comment thread src/sentry/preprod/snapshots/tasks.py Outdated
Comment thread src/sentry/preprod/snapshots/tasks.py Outdated
@NicoHinderling NicoHinderling changed the title fix(preprod): Prevent stale IN_PROGRESS status check from overwriting completed result fix(preprod): Eliminate race condition in snapshot status check posting May 15, 2026
@NicoHinderling NicoHinderling force-pushed the nico/fix-snapshot-status-check-race branch from 45be3cd to 152be68 Compare May 15, 2026 18:50
Comment thread src/sentry/preprod/snapshots/tasks.py Outdated
Comment thread src/sentry/preprod/snapshots/tasks.py Outdated
"caller": "upload_completion",
},
)
create_preprod_snapshot_pr_comment_task.apply_async(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Any reason these are different tasks? I thought we would have a similar update_vcs task that would do both the status check + comment in one go.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's a great idea. let me take a stab at that and ill try to incorporate it into this PR before merging

Comment thread src/sentry/preprod/snapshots/tasks.py Outdated
"Snapshot comparison artifact not found",
extra={"head_artifact_id": head_artifact_id, "base_artifact_id": base_artifact_id},
)
create_preprod_snapshot_status_check_task.apply_async(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these just spots you noticed were missing checks, unrelated to the issue here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also somewhat highlights what I mean about combining the VCS tasks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these just spots you noticed were missing checks, unrelated to the issue here?

precisely

Comment thread src/sentry/preprod/snapshots/tasks.py
NicoHinderling and others added 5 commits May 19, 2026 11:06
Consolidate status check ownership so compare_snapshots owns the full
IN_PROGRESS → SUCCESS/FAILURE lifecycle when a comparison is starting.
The upload endpoint no longer fires its own status check in that case.

Also add a staleness guard in post_snapshot_status_check_task: before
posting IN_PROGRESS, re-check the DB and skip if the comparison has
already reached a terminal state.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…pshots

The PreprodArtifact.DoesNotExist and PreprodSnapshotMetrics.DoesNotExist
handlers returned without posting a terminal status check. Since the
upload endpoint now skips its own status check when a comparison is
starting, these paths left the GitHub check stuck at "Processing."

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…are_snapshots

The artifact-not-found and metrics-not-found exits were posting the
status check but not the PR comment, leaving it stale.

Co-Authored-By: Claude <noreply@anthropic.com>
…ot_vcs

Extract a single entry point for dispatching both the status check and
PR comment tasks. All 10 paired call sites across the codebase now go
through this function, eliminating duplicated dispatch blocks and
ensuring consistent behavior when the kwargs or dispatch pattern change.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@NicoHinderling NicoHinderling force-pushed the nico/fix-snapshot-status-check-race branch from 5149ef5 to d9e69e5 Compare May 19, 2026 18:07
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e3ef59a. Configure here.

Comment thread src/sentry/preprod/snapshots/tasks.py
Comment on lines 350 to 360
)

update_preprod_snapshot_vcs(
preprod_artifact_id=head_artifact_id,
caller="compare_start",
update_pr_comment=False,
)

try:
head_artifact = PreprodArtifact.objects.select_related("project__organization").get(
id=head_artifact_id,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: A retried compare_snapshots task can leave a GitHub status check permanently stuck in an in-progress state if the comparison was already processing.
Severity: MEDIUM

Suggested Fix

Update the compare_snapshots task to handle the PROCESSING state. When a retried task finds a comparison in the PROCESSING state, it should avoid posting a new IN_PROGRESS status and exit, or it should be responsible for eventually posting a terminal status. Alternatively, the staleness guard could be updated to include PROCESSING as a state that prevents a new IN_PROGRESS status from being posted.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: src/sentry/preprod/snapshots/tasks.py#L349-L360

Potential issue: If a `compare_snapshots` task is retried for a comparison that is
already in the `PROCESSING` state, for instance, due to a prior worker being killed, it
posts a new `IN_PROGRESS` status check to GitHub. However, the task then exits early
without posting a terminal status. The staleness guard in
`post_snapshot_status_check_task` does not account for the `PROCESSING` state, only
`SUCCESS` or `FAILED`. This results in the GitHub check becoming permanently stuck in a
processing state, as no terminal status is ever posted.

Prevents posting an orphaned IN_PROGRESS status when a retried task
finds the comparison already being processed by another worker.
@NicoHinderling NicoHinderling enabled auto-merge (squash) May 19, 2026 18:46
@NicoHinderling NicoHinderling merged commit 01120c7 into master May 19, 2026
62 checks passed
@NicoHinderling NicoHinderling deleted the nico/fix-snapshot-status-check-race branch May 19, 2026 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants