feat(chunk-upload): Add zstd compression support for sourcemap uploads#113760
Merged
feat(chunk-upload): Add zstd compression support for sourcemap uploads#113760
Conversation
Extends the chunk-upload endpoint used by sourcemap / artifact-bundle / DIF / preprod uploads to accept zstd-compressed chunks in addition to the existing gzip path, advertised to clients via the `compression` capability list in the GET response. Compression codec is negotiated via the HTTP `Content-Encoding` request header (`gzip` or `zstd`), which becomes the forward-looking mechanism for both codecs. The legacy `file_gzip` multipart field continues to work for older clients but is deprecated in favor of the header. Mixing `Content-Encoding` with a `file_gzip` field in the same request is rejected with 400 to avoid ambiguity. Decompression is streamed with a hard cap of `settings.SENTRY_CHUNK_UPLOAD_BLOB_SIZE` (8 MiB) on the decoded output per chunk to protect against decompression bombs. The existing `chunk-upload.no-compression` org allowlist continues to disable all compression codecs.
BYK
added a commit
to getsentry/cli
that referenced
this pull request
Apr 23, 2026
From the review pass on PR #823: 1. Keep legacy `file_gzip` field for gzip (self-hosted compat). The previous commit switched gzip to `Content-Encoding: gzip` on the standard `file` field, which would silently corrupt uploads against any Sentry server that predates getsentry/sentry#113760 (the server reads the bytes as plain, SHA-1 mismatch, failure). Now: zstd uses `Content-Encoding: zstd` (new path), gzip uses the original `file_gzip` field (legacy path), plain uses `file`. 2. Use async `Bun.zstdCompress` instead of sync variant. The sync variant blocked the event loop for ~52ms per 8 MiB chunk; with concurrency=8 that defeated parallelism. Async runs off-thread. 3. Log at debug level when server advertises only unknown codecs. Previously the CLI silently fell back to plain, making perf regressions hard to diagnose. 4. Drop `UPLOAD_CODECS` export -- it was only used inside this module. Keep the `UploadEncoding` type export for future callers. 5. Tighten tests: assert gzip (1f 8b) and zstd (28 b5 2f fd) magic bytes on encoded output to catch codec mix-ups in future refactors. Verified end-to-end against a local mock server across all four negotiation outcomes (zstd+gzip advertised, gzip-only legacy server, kill-switch, unknown-codec-only).
BYK
added a commit
to getsentry/cli
that referenced
this pull request
Apr 23, 2026
## Summary When the Sentry server advertises `zstd` in the chunk-upload options response, compress chunks with zstd (Bun's built-in `Bun.zstdCompress`, async) instead of gzip. Preference order: **zstd > gzip > plain**. ## Wire format (per codec) | Server advertises | CLI encoding | Header | Multipart field | |---|---|---|---| | `[gzip, zstd]` | zstd | `Content-Encoding: zstd` | `file` | | `[gzip]` (pre-zstd server) | gzip | *(none)* | `file_gzip` (legacy) | | `[]` (kill-switch) | plain | *(none)* | `file` | | unknown codecs only | plain | *(none)* | `file` | The zstd path uses `Content-Encoding` to match the server's forward-looking detection in getsentry/sentry#113760. **The gzip path intentionally still uses the legacy `file_gzip` multipart field name**, with no `Content-Encoding` header, so that this new CLI keeps working against any Sentry server that predates #113760 — including self-hosted deployments that haven't upgraded yet. Sending `Content-Encoding: gzip` to such a server would cause the server to treat gzipped bytes as plain and fail the SHA-1 check. ## Changes - `pickUploadEncoding(compression)` — picks the most efficient codec from the server's advertised list; logs at debug level when the server advertises only codecs we don't implement. - `encodeChunk(buf, encoding)` — wraps `Bun.zstdCompress` (zstd, async, off-thread), `zlib.gzip` (gzip, async via libuv thread pool), or passes the buffer through (plain). Never blocks the event loop. - `uploadChunk(params)` — reads a chunk from the staging ZIP, compresses, and POSTs with the right field name / header combination per the table above. ## Compression levels - **zstd: L3** (libzstd's default, passed explicitly). Benchmarked vs. higher levels on a real 8 MiB sourcemap chunk: L9 shaves 14% off the wire at 4× the compress time and forces the server's decoder to allocate 15–30 MiB of window state. L3 is Pareto-optimal on both sides. - **gzip: L6** (zlib's default, left implicit). Counter-intuitively, lower gzip levels decompress *slower* (L5: 37 ms vs L6: 22 ms per 8 MiB chunk) because sparser Huffman codes are harder to decode. L9 costs ~2.4× the compress CPU for no meaningful size win. The default is right. ## Verified end-to-end Against a local mock server exercising the real `uploadSourcemaps` code path (GET options + assemble + chunk upload + poll): - `gzip,zstd` advertised → CLI sends `Content-Encoding: zstd`, field `file`, zstd magic bytes `28 b5 2f fd`. ✅ - `gzip` only advertised → CLI sends **no header**, field `file_gzip`, gzip magic `1f 8b`. (Self-hosted compat!) ✅ - `[]` kill-switch → CLI sends no header, field `file`, no compression magic. ✅ - `brotli` only advertised → CLI logs debug, falls back to plain. ✅ Three uploads to the same mock server — all returned 200, all landed the right bytes under the right field with the right headers. ## Tests Added `test/lib/api/sourcemaps.test.ts` (9 tests, 20 assertions, first test file for this module): - `pickUploadEncoding` — preference order, fallback to `undefined`, unknown-codec handling. - `encodeChunk` — round-trip via `zlib.gunzipSync` / `Bun.zstdDecompressSync` + **magic-byte assertions** (`1f 8b` for gzip, `28 b5 2f fd` for zstd) to catch codec mix-ups in future refactors. - `ChunkServerOptionsSchema` — accepts empty compression list and `["gzip", "zstd"]`. `bun run test:unit` → **5685 pass / 0 fail** (+9 new). Lint + typecheck clean. ## Companion PR Server side: getsentry/sentry#113760 — adds the matching `Content-Encoding: zstd` acceptance + `compression: ["gzip", "zstd"]` advertisement on the chunk-upload endpoint.
Three mypy errors were flagged on the initial PR: 1. `_read_bounded` accepted `IO[bytes]` but callers pass `GzipFile` and zstd's `stream_reader`, neither of which satisfy that protocol. Introduce a minimal `_Reader` protocol with just `read(n)`. 2. `get_files` yielded `IO[bytes]` but the `chunk.size` access at the call site requires Django's `UploadedFile`/our chunk wrappers which have `.size`. Introduce `_SizedChunk` protocol with `size: int` + `name: str` + `read(n)`. 3. The `files` buffer list typed with the new `_SizedChunk` protocol. No runtime behavior change; all 26 tests pass, mypy clean.
Code review caught that the initial PR caught zstandard.ZstdError on invalid zstd input but left gzip's analogous exceptions unhandled, so a malformed gzip payload returned a 500 instead of a 400. Now also catches: - gzip.BadGzipFile: bad magic bytes / bad header - EOFError: stream truncated mid-decompression This covers BOTH gzip paths: - Content-Encoding: gzip on the standard `file` field (new path) - The legacy `file_gzip` multipart field (backwards-compat path) Added 3 tests: - test_upload_invalid_gzip_payload_content_encoding - test_upload_invalid_gzip_payload_legacy_field - test_upload_truncated_gzip_payload (valid header, missing tail) All 29 tests pass, mypy clean.
Contributor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 646ddf2. Configure here.
- Drop explicit type annotations that original file didn't have (GzipChunk.__init__, ZstdChunk.__init__, get_files, files/checksums locals, compression). The surrounding file is mostly untyped at these sites, so the annotations just added line noise. - Drop the _Reader and _SizedChunk Protocols that only existed to satisfy mypy errors introduced by the annotations above. Removing the annotations removed the errors too; mypy stays clean. - Drop getattr(file, "name", "") fallback in GzipChunk/ZstdChunk and restore direct file.name access matching the original GzipChunk. The file always has .name; the fallback was defensive slop. - Drop the second reader.read(1) probe in _read_bounded; read(limit+1) followed by the length check is sufficient, and the extra probe was documented as 'shouldn't happen in practice'. - Trim over-verbose docstrings (CHUNK_UPLOAD_COMPRESSION header comment, get_files docstring, post docstring additions, per-except comments). - Inline the 'import gzip as gzip_mod' in the truncated-gzip test -- gzip is already imported at module level. mypy clean, 29 tests pass, pre-commit clean. Net diff now +169/-30 on chunk.py (was +199/-30).
Follow-up review caught that the single `reader.read(limit + 1)` call in `_read_bounded` relies on readers filling the buffer in one call. That holds for `GzipFile.read(n)` (BufferedIOBase semantics), but `zstandard.ZstdDecompressor().stream_reader(file).read(n)` is documented to potentially short-read when its underlying source does -- for `SimpleUploadedFile`/`BytesIO` it doesn't in practice, but a file-backed `TemporaryUploadedFile` could. Without a loop, a bomb payload that happens to straddle a short-read boundary could slip past the 8 MiB cap. Fixed by looping until EOF or the cap is exceeded. Peak memory stays bounded at `limit + 1` bytes. Also pinned error-response bodies on the negative tests so a future typo in the error string would be caught.
Cursor Bugbot caught: `zstandard.ZstdDecompressor().stream_reader(file)` defaults to `read_across_frames=False`, so a client sending a multi-frame zstd payload (concatenated frames) would have only the first frame decoded -- rest silently discarded -- producing a confusing downstream checksum mismatch instead of correct decompression. `GzipFile.read()` in `GzipChunk` already reads across concatenated gzip members, so the two codecs now behave consistently. Matches the convention established in `src/sentry/models/eventattachment.py:179` which passes the same flag for the same reason. Added `test_upload_content_encoding_zstd_multi_frame` which concatenates two independent zstd frames and asserts the decoded payload is the concatenation of both plaintexts (30/30 tests).
armenzg
approved these changes
Apr 23, 2026
Member
armenzg
left a comment
There was a problem hiding this comment.
The changes look good.
Feel free to ignore the logging suggestions. It's a pet-peeve of mine.
My only question for you is the option gate in the post method.
Per review feedback, use logger.warning (not info) for HTTP 400 "chunkupload.end" log lines in the paths touched by this PR so they page/alert at the right level. Extends to the 2 pre-existing sites that now live inside the new try/except block for consistency. The 2 pre-existing sites OUTSIDE this PR's surface (too-many-chunks, from_files OSError) are left on logger.info unchanged to keep the diff focused.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Extends the chunk-upload endpoint used by sourcemap / artifact-bundle / DIF / preprod uploads to accept zstd-compressed chunks in addition to the existing gzip path, advertised to clients via the
compressioncapability list in the GET response.Protocol changes
Content-Encodingdetection path. WhenContent-Encoding: gziporContent-Encoding: zstdis set on a chunk upload POST, every chunk delivered under the standardfilemultipart field is decoded using that codec.file_gzipfield still works for backwards compatibility with oldersentry-cli/@sentry/cliversions that only know how to send chunks under that field name.Cannot combine Content-Encoding with file_gzip field). A single request cannot use both mechanisms.Unsupported Content-Encoding).compressionlist extended to["gzip", "zstd"]. NoCHUNK_UPLOAD_ACCEPTentry is added — clients have always feature-detected compression via this list. The existingchunk-upload.no-compressionorg allowlist kill-switch continues to disable all compression codecs.Security / safety
Decompression is streamed through
zstandard.ZstdDecompressor().stream_reader(file)(andGzipFilefor gzip) via a bounded-read helper that loops until EOF or a hard cap ofsettings.SENTRY_CHUNK_UPLOAD_BLOB_SIZE(8 MiB) on decoded output per chunk. The loop handles short reads from streaming decoders (important forZstdDecompressor.stream_reader+ file-backedTemporaryUploadedFile). Excess bytes raiseChunkTooLarge→ 400. Peak memory stays bounded atlimit + 1bytes regardless of how well-compressed the input is, protecting against decompression-bomb inputs where a tiny compressed payload would expand to hundreds of megabytes.Invalid / malformed payloads are all converted to 400 (never 500):
zstandard.ZstdError→Invalid zstd payloadgzip.BadGzipFile(bad magic / header) →Invalid gzip payloadEOFError(truncated stream) →Invalid gzip payloadEnd-to-end verified
Ran the full request pipeline (middleware + auth + DRF + endpoint) against the live getsentry devservices stack on a fresh Coder VM and confirmed:
"compression": ["gzip", "zstd"].Content-Encoding: zstd+ a zstd-compressed chunk → 200 +FileBlobpersisted.Content-Encoding: gzip+ a gzipped chunk → 200 (new gzip-via-header path).file_gzipfield (no header) → 200 (backwards compat).{"error": "Cannot combine Content-Encoding with file_gzip field"}.Content-Encoding: br→ 400{"error": "Unsupported Content-Encoding"}.{"error": "Chunk size too large"}without OOMing.Tests
test_chunk_parametersto assertcompression == ["gzip", "zstd"].tests/sentry/api/endpoints/test_chunk_upload.py:test_chunk_parameters_no_compression_option— kill switch zeroes the list.test_upload_content_encoding_zstd— happy path, 2 chunks.test_upload_content_encoding_zstd_multi_frame— concatenated zstd frames decode fully (read_across_frames=True).test_upload_content_encoding_zstd_mixed_case—Content-Encoding: ZSTDaccepted (RFC 7231 case-insensitive).test_upload_content_encoding_gzip— gzip-via-header path.test_upload_legacy_file_gzip_still_works— backwards compat.test_upload_rejects_content_encoding_with_file_gzip— 400 on ambiguous request.test_upload_rejects_unsupported_content_encoding— 400 onbr.test_upload_zstd_decompression_bomb— rejects 8 MiB+1 expansion.test_upload_invalid_zstd_payload— malformed zstd → 400, not 500.test_upload_invalid_gzip_payload_content_encoding— malformed gzip via header → 400.test_upload_invalid_gzip_payload_legacy_field— malformed gzip viafile_gzip→ 400.test_upload_truncated_gzip_payload— valid header, truncated stream → 400.Negative tests also pin the exact error-body shape so a typo in the error string would be caught.
pytest --reuse-db tests/sentry/api/endpoints/test_chunk_upload.py→ 30 passed. mypy + pre-commit clean.Scope boundaries
src/sentry/tasks/assemble.py,src/sentry/debug_files/upload.py, or the assemble endpoints — they operate on already-decompressedFileBlobs keyed by sha1.zstandard>=0.18.0is already a direct dep inpyproject.tomland used extensively elsewhere (src/sentry/utils/codecs.py,src/sentry/attachments/base.py,src/sentry/spans/buffer.py, etc.). No new dependencies.compressionlist, matching the existing pattern.{"error": ...}to stay consistent with the rest of this file (4 pre-existing"error"sites at lines 264/276/283/304) and with the neighbouring chunk-upload surface (organization_artifactbundle_assemble.py).src/AGENTS.mdnow recommends{"detail": ...}for new endpoints; migrating the whole chunk-upload subsystem is out of scope for this PR.Companion PR
CLI side: getsentry/cli#823 (merged) — teaches the new TS/Bun
@loreai/clito send zstd-compressed chunks via theContent-Encoding: zstdheader when the server advertises it.