feat(chunk-upload): Add zstd compression support for sourcemap uploads by BYK · Pull Request #113760 · getsentry/sentry

BYK · 2026-04-22T22:51:47Z

Summary

Extends the chunk-upload endpoint used by sourcemap / artifact-bundle / DIF / preprod uploads to accept zstd-compressed chunks in addition to the existing gzip path, advertised to clients via the compression capability list in the GET response.

Protocol changes

New Content-Encoding detection path. When Content-Encoding: gzip or Content-Encoding: zstd is set on a chunk upload POST, every chunk delivered under the standard file multipart field is decoded using that codec.
Legacy file_gzip field still works for backwards compatibility with older sentry-cli / @sentry/cli versions that only know how to send chunks under that field name.
Ambiguous requests are rejected with HTTP 400 (Cannot combine Content-Encoding with file_gzip field). A single request cannot use both mechanisms.
Unsupported encodings return HTTP 400 (Unsupported Content-Encoding).
compression list extended to ["gzip", "zstd"]. No CHUNK_UPLOAD_ACCEPT entry is added — clients have always feature-detected compression via this list. The existing chunk-upload.no-compression org allowlist kill-switch continues to disable all compression codecs.

Security / safety

Decompression is streamed through zstandard.ZstdDecompressor().stream_reader(file) (and GzipFile for gzip) via a bounded-read helper that loops until EOF or a hard cap of settings.SENTRY_CHUNK_UPLOAD_BLOB_SIZE (8 MiB) on decoded output per chunk. The loop handles short reads from streaming decoders (important for ZstdDecompressor.stream_reader + file-backed TemporaryUploadedFile). Excess bytes raise ChunkTooLarge → 400. Peak memory stays bounded at limit + 1 bytes regardless of how well-compressed the input is, protecting against decompression-bomb inputs where a tiny compressed payload would expand to hundreds of megabytes.

Invalid / malformed payloads are all converted to 400 (never 500):

zstandard.ZstdError → Invalid zstd payload
gzip.BadGzipFile (bad magic / header) → Invalid gzip payload
EOFError (truncated stream) → Invalid gzip payload

End-to-end verified

Ran the full request pipeline (middleware + auth + DRF + endpoint) against the live getsentry devservices stack on a fresh Coder VM and confirmed:

GET options endpoint advertises "compression": ["gzip", "zstd"].
POST with Content-Encoding: zstd + a zstd-compressed chunk → 200 + FileBlob persisted.
POST with Content-Encoding: gzip + a gzipped chunk → 200 (new gzip-via-header path).
POST with legacy file_gzip field (no header) → 200 (backwards compat).
POST combining both forms → 400 {"error": "Cannot combine Content-Encoding with file_gzip field"}.
POST with Content-Encoding: br → 400 {"error": "Unsupported Content-Encoding"}.
POST with a zstd payload that would decompress to 8 MiB+1 → 400 {"error": "Chunk size too large"} without OOMing.

Tests

Updated test_chunk_parameters to assert compression == ["gzip", "zstd"].
Added 12 new tests in tests/sentry/api/endpoints/test_chunk_upload.py:
- test_chunk_parameters_no_compression_option — kill switch zeroes the list.
- test_upload_content_encoding_zstd — happy path, 2 chunks.
- test_upload_content_encoding_zstd_multi_frame — concatenated zstd frames decode fully (read_across_frames=True).
- test_upload_content_encoding_zstd_mixed_case — Content-Encoding: ZSTD accepted (RFC 7231 case-insensitive).
- test_upload_content_encoding_gzip — gzip-via-header path.
- test_upload_legacy_file_gzip_still_works — backwards compat.
- test_upload_rejects_content_encoding_with_file_gzip — 400 on ambiguous request.
- test_upload_rejects_unsupported_content_encoding — 400 on br.
- test_upload_zstd_decompression_bomb — rejects 8 MiB+1 expansion.
- test_upload_invalid_zstd_payload — malformed zstd → 400, not 500.
- test_upload_invalid_gzip_payload_content_encoding — malformed gzip via header → 400.
- test_upload_invalid_gzip_payload_legacy_field — malformed gzip via file_gzip → 400.
- test_upload_truncated_gzip_payload — valid header, truncated stream → 400.

Negative tests also pin the exact error-body shape so a typo in the error string would be caught.

pytest --reuse-db tests/sentry/api/endpoints/test_chunk_upload.py → 30 passed. mypy + pre-commit clean.

Scope boundaries

No changes to src/sentry/tasks/assemble.py, src/sentry/debug_files/upload.py, or the assemble endpoints — they operate on already-decompressed FileBlobs keyed by sha1.
zstandard>=0.18.0 is already a direct dep in pyproject.toml and used extensively elsewhere (src/sentry/utils/codecs.py, src/sentry/attachments/base.py, src/sentry/spans/buffer.py, etc.). No new dependencies.
No protocol version bump needed — clients feature-detect via the compression list, matching the existing pattern.
Error responses use {"error": ...} to stay consistent with the rest of this file (4 pre-existing "error" sites at lines 264/276/283/304) and with the neighbouring chunk-upload surface (organization_artifactbundle_assemble.py). src/AGENTS.md now recommends {"detail": ...} for new endpoints; migrating the whole chunk-upload subsystem is out of scope for this PR.

Companion PR

CLI side: getsentry/cli#823 (merged) — teaches the new TS/Bun @loreai/cli to send zstd-compressed chunks via the Content-Encoding: zstd header when the server advertises it.

Extends the chunk-upload endpoint used by sourcemap / artifact-bundle / DIF / preprod uploads to accept zstd-compressed chunks in addition to the existing gzip path, advertised to clients via the `compression` capability list in the GET response. Compression codec is negotiated via the HTTP `Content-Encoding` request header (`gzip` or `zstd`), which becomes the forward-looking mechanism for both codecs. The legacy `file_gzip` multipart field continues to work for older clients but is deprecated in favor of the header. Mixing `Content-Encoding` with a `file_gzip` field in the same request is rejected with 400 to avoid ambiguity. Decompression is streamed with a hard cap of `settings.SENTRY_CHUNK_UPLOAD_BLOB_SIZE` (8 MiB) on the decoded output per chunk to protect against decompression bombs. The existing `chunk-upload.no-compression` org allowlist continues to disable all compression codecs.

From the review pass on PR #823: 1. Keep legacy `file_gzip` field for gzip (self-hosted compat). The previous commit switched gzip to `Content-Encoding: gzip` on the standard `file` field, which would silently corrupt uploads against any Sentry server that predates getsentry/sentry#113760 (the server reads the bytes as plain, SHA-1 mismatch, failure). Now: zstd uses `Content-Encoding: zstd` (new path), gzip uses the original `file_gzip` field (legacy path), plain uses `file`. 2. Use async `Bun.zstdCompress` instead of sync variant. The sync variant blocked the event loop for ~52ms per 8 MiB chunk; with concurrency=8 that defeated parallelism. Async runs off-thread. 3. Log at debug level when server advertises only unknown codecs. Previously the CLI silently fell back to plain, making perf regressions hard to diagnose. 4. Drop `UPLOAD_CODECS` export -- it was only used inside this module. Keep the `UploadEncoding` type export for future callers. 5. Tighten tests: assert gzip (1f 8b) and zstd (28 b5 2f fd) magic bytes on encoded output to catch codec mix-ups in future refactors. Verified end-to-end against a local mock server across all four negotiation outcomes (zstd+gzip advertised, gzip-only legacy server, kill-switch, unknown-codec-only).

## Summary When the Sentry server advertises `zstd` in the chunk-upload options response, compress chunks with zstd (Bun's built-in `Bun.zstdCompress`, async) instead of gzip. Preference order: **zstd > gzip > plain**. ## Wire format (per codec) | Server advertises | CLI encoding | Header | Multipart field | |---|---|---|---| | `[gzip, zstd]` | zstd | `Content-Encoding: zstd` | `file` | | `[gzip]` (pre-zstd server) | gzip | *(none)* | `file_gzip` (legacy) | | `[]` (kill-switch) | plain | *(none)* | `file` | | unknown codecs only | plain | *(none)* | `file` | The zstd path uses `Content-Encoding` to match the server's forward-looking detection in getsentry/sentry#113760. **The gzip path intentionally still uses the legacy `file_gzip` multipart field name**, with no `Content-Encoding` header, so that this new CLI keeps working against any Sentry server that predates #113760 — including self-hosted deployments that haven't upgraded yet. Sending `Content-Encoding: gzip` to such a server would cause the server to treat gzipped bytes as plain and fail the SHA-1 check. ## Changes - `pickUploadEncoding(compression)` — picks the most efficient codec from the server's advertised list; logs at debug level when the server advertises only codecs we don't implement. - `encodeChunk(buf, encoding)` — wraps `Bun.zstdCompress` (zstd, async, off-thread), `zlib.gzip` (gzip, async via libuv thread pool), or passes the buffer through (plain). Never blocks the event loop. - `uploadChunk(params)` — reads a chunk from the staging ZIP, compresses, and POSTs with the right field name / header combination per the table above. ## Compression levels - **zstd: L3** (libzstd's default, passed explicitly). Benchmarked vs. higher levels on a real 8 MiB sourcemap chunk: L9 shaves 14% off the wire at 4× the compress time and forces the server's decoder to allocate 15–30 MiB of window state. L3 is Pareto-optimal on both sides. - **gzip: L6** (zlib's default, left implicit). Counter-intuitively, lower gzip levels decompress *slower* (L5: 37 ms vs L6: 22 ms per 8 MiB chunk) because sparser Huffman codes are harder to decode. L9 costs ~2.4× the compress CPU for no meaningful size win. The default is right. ## Verified end-to-end Against a local mock server exercising the real `uploadSourcemaps` code path (GET options + assemble + chunk upload + poll): - `gzip,zstd` advertised → CLI sends `Content-Encoding: zstd`, field `file`, zstd magic bytes `28 b5 2f fd`. ✅ - `gzip` only advertised → CLI sends **no header**, field `file_gzip`, gzip magic `1f 8b`. (Self-hosted compat!) ✅ - `[]` kill-switch → CLI sends no header, field `file`, no compression magic. ✅ - `brotli` only advertised → CLI logs debug, falls back to plain. ✅ Three uploads to the same mock server — all returned 200, all landed the right bytes under the right field with the right headers. ## Tests Added `test/lib/api/sourcemaps.test.ts` (9 tests, 20 assertions, first test file for this module): - `pickUploadEncoding` — preference order, fallback to `undefined`, unknown-codec handling. - `encodeChunk` — round-trip via `zlib.gunzipSync` / `Bun.zstdDecompressSync` + **magic-byte assertions** (`1f 8b` for gzip, `28 b5 2f fd` for zstd) to catch codec mix-ups in future refactors. - `ChunkServerOptionsSchema` — accepts empty compression list and `["gzip", "zstd"]`. `bun run test:unit` → **5685 pass / 0 fail** (+9 new). Lint + typecheck clean. ## Companion PR Server side: getsentry/sentry#113760 — adds the matching `Content-Encoding: zstd` acceptance + `compression: ["gzip", "zstd"]` advertisement on the chunk-upload endpoint.

Three mypy errors were flagged on the initial PR: 1. `_read_bounded` accepted `IO[bytes]` but callers pass `GzipFile` and zstd's `stream_reader`, neither of which satisfy that protocol. Introduce a minimal `_Reader` protocol with just `read(n)`. 2. `get_files` yielded `IO[bytes]` but the `chunk.size` access at the call site requires Django's `UploadedFile`/our chunk wrappers which have `.size`. Introduce `_SizedChunk` protocol with `size: int` + `name: str` + `read(n)`. 3. The `files` buffer list typed with the new `_SizedChunk` protocol. No runtime behavior change; all 26 tests pass, mypy clean.

Code review caught that the initial PR caught zstandard.ZstdError on invalid zstd input but left gzip's analogous exceptions unhandled, so a malformed gzip payload returned a 500 instead of a 400. Now also catches: - gzip.BadGzipFile: bad magic bytes / bad header - EOFError: stream truncated mid-decompression This covers BOTH gzip paths: - Content-Encoding: gzip on the standard `file` field (new path) - The legacy `file_gzip` multipart field (backwards-compat path) Added 3 tests: - test_upload_invalid_gzip_payload_content_encoding - test_upload_invalid_gzip_payload_legacy_field - test_upload_truncated_gzip_payload (valid header, missing tail) All 29 tests pass, mypy clean.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 646ddf2. Configure here.}

- Drop explicit type annotations that original file didn't have (GzipChunk.__init__, ZstdChunk.__init__, get_files, files/checksums locals, compression). The surrounding file is mostly untyped at these sites, so the annotations just added line noise. - Drop the _Reader and _SizedChunk Protocols that only existed to satisfy mypy errors introduced by the annotations above. Removing the annotations removed the errors too; mypy stays clean. - Drop getattr(file, "name", "") fallback in GzipChunk/ZstdChunk and restore direct file.name access matching the original GzipChunk. The file always has .name; the fallback was defensive slop. - Drop the second reader.read(1) probe in _read_bounded; read(limit+1) followed by the length check is sufficient, and the extra probe was documented as 'shouldn't happen in practice'. - Trim over-verbose docstrings (CHUNK_UPLOAD_COMPRESSION header comment, get_files docstring, post docstring additions, per-except comments). - Inline the 'import gzip as gzip_mod' in the truncated-gzip test -- gzip is already imported at module level. mypy clean, 29 tests pass, pre-commit clean. Net diff now +169/-30 on chunk.py (was +199/-30).

Follow-up review caught that the single `reader.read(limit + 1)` call in `_read_bounded` relies on readers filling the buffer in one call. That holds for `GzipFile.read(n)` (BufferedIOBase semantics), but `zstandard.ZstdDecompressor().stream_reader(file).read(n)` is documented to potentially short-read when its underlying source does -- for `SimpleUploadedFile`/`BytesIO` it doesn't in practice, but a file-backed `TemporaryUploadedFile` could. Without a loop, a bomb payload that happens to straddle a short-read boundary could slip past the 8 MiB cap. Fixed by looping until EOF or the cap is exceeded. Peak memory stays bounded at `limit + 1` bytes. Also pinned error-response bodies on the negative tests so a future typo in the error string would be caught.

Cursor Bugbot caught: `zstandard.ZstdDecompressor().stream_reader(file)` defaults to `read_across_frames=False`, so a client sending a multi-frame zstd payload (concatenated frames) would have only the first frame decoded -- rest silently discarded -- producing a confusing downstream checksum mismatch instead of correct decompression. `GzipFile.read()` in `GzipChunk` already reads across concatenated gzip members, so the two codecs now behave consistently. Matches the convention established in `src/sentry/models/eventattachment.py:179` which passes the same flag for the same reason. Added `test_upload_content_encoding_zstd_multi_frame` which concatenates two independent zstd frames and asserts the decoded payload is the concatenation of both plaintexts (30/30 tests).

armenzg

The changes look good.
Feel free to ignore the logging suggestions. It's a pet-peeve of mine.
My only question for you is the option gate in the post method.

Per review feedback, use logger.warning (not info) for HTTP 400 "chunkupload.end" log lines in the paths touched by this PR so they page/alert at the right level. Extends to the 2 pre-existing sites that now live inside the new try/except block for consistency. The 2 pre-existing sites OUTSIDE this PR's surface (too-many-chunks, from_files OSError) are left on logger.info unchanged to keep the diff focused.

BYK requested review from a team as code owners April 22, 2026 22:51

github-actions Bot added the Scope: Backend Automatically applied to PRs that change backend components label Apr 22, 2026

BYK mentioned this pull request Apr 22, 2026

feat(sourcemap-upload): Add zstd compression support getsentry/cli#823

Merged

vercel Bot deployed to Preview April 22, 2026 22:52 View deployment

sentry-warden Bot reviewed Apr 22, 2026

View reviewed changes