Skip to content

fix(snapshots): Add instrumentation logging to snapshot download stream#116079

Merged
NicoHinderling merged 3 commits into
masterfrom
nico/snapshot-download-logging
May 26, 2026
Merged

fix(snapshots): Add instrumentation logging to snapshot download stream#116079
NicoHinderling merged 3 commits into
masterfrom
nico/snapshot-download-logging

Conversation

@NicoHinderling
Copy link
Copy Markdown
Contributor

Add per-batch and lifecycle logging to the streaming ZIP download endpoint
to diagnose why 40K-image snapshot downloads fail with HTTP/2 stream errors
after ~20-25 seconds.

The streaming generator currently has zero observability — the Django span
closes in ~1s (before the generator runs), and when the stream dies
externally, nothing is logged. We need to see:

  • How many batches/images are processed before the stream dies
  • Whether there are long gaps between yielded chunks (idle timeout?)
  • Fetch latency distribution from objectstore (slow reads?)
  • Memory usage over time (OOM kill?)
  • How the generator exits: completion, GeneratorExit (client/server disconnect), or unhandled exception

Log points added:

  • stream_start — confirms generator is running, PID, memory baseline
  • batch_complete — per batch: duration, cumulative progress, max yield gap, fetch latency p50/p99/max, RSS
  • stream_complete — successful finish with totals
  • stream_error — unhandled exception with traceback
  • stream_finally — always fires with exit reason and final stats

This is temporary instrumentation to be removed once the root cause is found.

Adds per-batch and lifecycle logging to diagnose HTTP/2 stream errors
on large (40K image) snapshot downloads. Tracks batch progress, fetch
latency percentiles, gaps between yields, memory usage, and how the
stream terminates (complete, client disconnect, or exception).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@NicoHinderling NicoHinderling requested a review from a team as a code owner May 22, 2026 06:35
@github-actions github-actions Bot added the Scope: Backend Automatically applied to PRs that change backend components label May 22, 2026
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b79a120. Configure here.

Comment thread src/sentry/preprod/api/endpoints/snapshots/preprod_artifact_snapshot_download.py Outdated
Comment thread src/sentry/preprod/api/endpoints/snapshots/preprod_artifact_snapshot_download.py Outdated
Comment thread src/sentry/preprod/api/endpoints/snapshots/preprod_artifact_snapshot_download.py Outdated
- Use sys.platform to choose correct ru_maxrss divisor (KB on Linux,
  bytes on macOS)
- Move stream_finally logging after zf.close() and final buf.drain()
  so bytes_yielded includes the ZIP central directory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@NicoHinderling NicoHinderling merged commit c4fcb9f into master May 26, 2026
62 checks passed
@NicoHinderling NicoHinderling deleted the nico/snapshot-download-logging branch May 26, 2026 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants