Skip to content

feat(chunk-upload): Add zstd compression support for sourcemap uploads#113760

Merged
BYK merged 8 commits intomasterfrom
byk/chunk-upload-zstd
Apr 24, 2026
Merged

feat(chunk-upload): Add zstd compression support for sourcemap uploads#113760
BYK merged 8 commits intomasterfrom
byk/chunk-upload-zstd

Conversation

@BYK
Copy link
Copy Markdown
Member

@BYK BYK commented Apr 22, 2026

Summary

Extends the chunk-upload endpoint used by sourcemap / artifact-bundle / DIF / preprod uploads to accept zstd-compressed chunks in addition to the existing gzip path, advertised to clients via the compression capability list in the GET response.

Protocol changes

  • New Content-Encoding detection path. When Content-Encoding: gzip or Content-Encoding: zstd is set on a chunk upload POST, every chunk delivered under the standard file multipart field is decoded using that codec.
  • Legacy file_gzip field still works for backwards compatibility with older sentry-cli / @sentry/cli versions that only know how to send chunks under that field name.
  • Ambiguous requests are rejected with HTTP 400 (Cannot combine Content-Encoding with file_gzip field). A single request cannot use both mechanisms.
  • Unsupported encodings return HTTP 400 (Unsupported Content-Encoding).
  • compression list extended to ["gzip", "zstd"]. No CHUNK_UPLOAD_ACCEPT entry is added — clients have always feature-detected compression via this list. The existing chunk-upload.no-compression org allowlist kill-switch continues to disable all compression codecs.

Security / safety

Decompression is streamed through zstandard.ZstdDecompressor().stream_reader(file) (and GzipFile for gzip) via a bounded-read helper that loops until EOF or a hard cap of settings.SENTRY_CHUNK_UPLOAD_BLOB_SIZE (8 MiB) on decoded output per chunk. The loop handles short reads from streaming decoders (important for ZstdDecompressor.stream_reader + file-backed TemporaryUploadedFile). Excess bytes raise ChunkTooLarge → 400. Peak memory stays bounded at limit + 1 bytes regardless of how well-compressed the input is, protecting against decompression-bomb inputs where a tiny compressed payload would expand to hundreds of megabytes.

Invalid / malformed payloads are all converted to 400 (never 500):

  • zstandard.ZstdErrorInvalid zstd payload
  • gzip.BadGzipFile (bad magic / header) → Invalid gzip payload
  • EOFError (truncated stream) → Invalid gzip payload

End-to-end verified

Ran the full request pipeline (middleware + auth + DRF + endpoint) against the live getsentry devservices stack on a fresh Coder VM and confirmed:

  • GET options endpoint advertises "compression": ["gzip", "zstd"].
  • POST with Content-Encoding: zstd + a zstd-compressed chunk → 200 + FileBlob persisted.
  • POST with Content-Encoding: gzip + a gzipped chunk → 200 (new gzip-via-header path).
  • POST with legacy file_gzip field (no header) → 200 (backwards compat).
  • POST combining both forms → 400 {"error": "Cannot combine Content-Encoding with file_gzip field"}.
  • POST with Content-Encoding: br → 400 {"error": "Unsupported Content-Encoding"}.
  • POST with a zstd payload that would decompress to 8 MiB+1 → 400 {"error": "Chunk size too large"} without OOMing.

Tests

  • Updated test_chunk_parameters to assert compression == ["gzip", "zstd"].
  • Added 12 new tests in tests/sentry/api/endpoints/test_chunk_upload.py:
    • test_chunk_parameters_no_compression_option — kill switch zeroes the list.
    • test_upload_content_encoding_zstd — happy path, 2 chunks.
    • test_upload_content_encoding_zstd_multi_frame — concatenated zstd frames decode fully (read_across_frames=True).
    • test_upload_content_encoding_zstd_mixed_caseContent-Encoding: ZSTD accepted (RFC 7231 case-insensitive).
    • test_upload_content_encoding_gzip — gzip-via-header path.
    • test_upload_legacy_file_gzip_still_works — backwards compat.
    • test_upload_rejects_content_encoding_with_file_gzip — 400 on ambiguous request.
    • test_upload_rejects_unsupported_content_encoding — 400 on br.
    • test_upload_zstd_decompression_bomb — rejects 8 MiB+1 expansion.
    • test_upload_invalid_zstd_payload — malformed zstd → 400, not 500.
    • test_upload_invalid_gzip_payload_content_encoding — malformed gzip via header → 400.
    • test_upload_invalid_gzip_payload_legacy_field — malformed gzip via file_gzip → 400.
    • test_upload_truncated_gzip_payload — valid header, truncated stream → 400.

Negative tests also pin the exact error-body shape so a typo in the error string would be caught.

pytest --reuse-db tests/sentry/api/endpoints/test_chunk_upload.py30 passed. mypy + pre-commit clean.

Scope boundaries

  • No changes to src/sentry/tasks/assemble.py, src/sentry/debug_files/upload.py, or the assemble endpoints — they operate on already-decompressed FileBlobs keyed by sha1.
  • zstandard>=0.18.0 is already a direct dep in pyproject.toml and used extensively elsewhere (src/sentry/utils/codecs.py, src/sentry/attachments/base.py, src/sentry/spans/buffer.py, etc.). No new dependencies.
  • No protocol version bump needed — clients feature-detect via the compression list, matching the existing pattern.
  • Error responses use {"error": ...} to stay consistent with the rest of this file (4 pre-existing "error" sites at lines 264/276/283/304) and with the neighbouring chunk-upload surface (organization_artifactbundle_assemble.py). src/AGENTS.md now recommends {"detail": ...} for new endpoints; migrating the whole chunk-upload subsystem is out of scope for this PR.

Companion PR

CLI side: getsentry/cli#823 (merged) — teaches the new TS/Bun @loreai/cli to send zstd-compressed chunks via the Content-Encoding: zstd header when the server advertises it.

Extends the chunk-upload endpoint used by sourcemap / artifact-bundle /
DIF / preprod uploads to accept zstd-compressed chunks in addition to
the existing gzip path, advertised to clients via the `compression`
capability list in the GET response.

Compression codec is negotiated via the HTTP `Content-Encoding` request
header (`gzip` or `zstd`), which becomes the forward-looking mechanism
for both codecs. The legacy `file_gzip` multipart field continues to
work for older clients but is deprecated in favor of the header. Mixing
`Content-Encoding` with a `file_gzip` field in the same request is
rejected with 400 to avoid ambiguity.

Decompression is streamed with a hard cap of
`settings.SENTRY_CHUNK_UPLOAD_BLOB_SIZE` (8 MiB) on the decoded output
per chunk to protect against decompression bombs. The existing
`chunk-upload.no-compression` org allowlist continues to disable all
compression codecs.
@BYK BYK requested review from a team as code owners April 22, 2026 22:51
@github-actions github-actions Bot added the Scope: Backend Automatically applied to PRs that change backend components label Apr 22, 2026
Comment thread src/sentry/api/endpoints/chunk.py
Comment thread src/sentry/api/endpoints/chunk.py
Comment thread src/sentry/api/endpoints/chunk.py
BYK added a commit to getsentry/cli that referenced this pull request Apr 23, 2026
From the review pass on PR #823:

1. Keep legacy `file_gzip` field for gzip (self-hosted compat).
   The previous commit switched gzip to `Content-Encoding: gzip` on
   the standard `file` field, which would silently corrupt uploads
   against any Sentry server that predates getsentry/sentry#113760
   (the server reads the bytes as plain, SHA-1 mismatch, failure).
   Now: zstd uses `Content-Encoding: zstd` (new path), gzip uses
   the original `file_gzip` field (legacy path), plain uses `file`.

2. Use async `Bun.zstdCompress` instead of sync variant.
   The sync variant blocked the event loop for ~52ms per 8 MiB chunk;
   with concurrency=8 that defeated parallelism. Async runs off-thread.

3. Log at debug level when server advertises only unknown codecs.
   Previously the CLI silently fell back to plain, making perf
   regressions hard to diagnose.

4. Drop `UPLOAD_CODECS` export -- it was only used inside this module.
   Keep the `UploadEncoding` type export for future callers.

5. Tighten tests: assert gzip (1f 8b) and zstd (28 b5 2f fd) magic
   bytes on encoded output to catch codec mix-ups in future refactors.

Verified end-to-end against a local mock server across all four
negotiation outcomes (zstd+gzip advertised, gzip-only legacy server,
kill-switch, unknown-codec-only).
BYK added a commit to getsentry/cli that referenced this pull request Apr 23, 2026
## Summary

When the Sentry server advertises `zstd` in the chunk-upload options
response, compress chunks with zstd (Bun's built-in `Bun.zstdCompress`,
async) instead of gzip. Preference order: **zstd > gzip > plain**.

## Wire format (per codec)

| Server advertises | CLI encoding | Header | Multipart field |
|---|---|---|---|
| `[gzip, zstd]` | zstd | `Content-Encoding: zstd` | `file` |
| `[gzip]` (pre-zstd server) | gzip | *(none)* | `file_gzip` (legacy) |
| `[]` (kill-switch) | plain | *(none)* | `file` |
| unknown codecs only | plain | *(none)* | `file` |

The zstd path uses `Content-Encoding` to match the server's
forward-looking detection in getsentry/sentry#113760.

**The gzip path intentionally still uses the legacy `file_gzip`
multipart field name**, with no `Content-Encoding` header, so that this
new CLI keeps working against any Sentry server that predates #113760 —
including self-hosted deployments that haven't upgraded yet. Sending
`Content-Encoding: gzip` to such a server would cause the server to
treat gzipped bytes as plain and fail the SHA-1 check.

## Changes

- `pickUploadEncoding(compression)` — picks the most efficient codec
from the server's advertised list; logs at debug level when the server
advertises only codecs we don't implement.
- `encodeChunk(buf, encoding)` — wraps `Bun.zstdCompress` (zstd, async,
off-thread), `zlib.gzip` (gzip, async via libuv thread pool), or passes
the buffer through (plain). Never blocks the event loop.
- `uploadChunk(params)` — reads a chunk from the staging ZIP,
compresses, and POSTs with the right field name / header combination per
the table above.

## Compression levels

- **zstd: L3** (libzstd's default, passed explicitly). Benchmarked vs.
higher levels on a real 8 MiB sourcemap chunk: L9 shaves 14% off the
wire at 4× the compress time and forces the server's decoder to allocate
15–30 MiB of window state. L3 is Pareto-optimal on both sides.
- **gzip: L6** (zlib's default, left implicit). Counter-intuitively,
lower gzip levels decompress *slower* (L5: 37 ms vs L6: 22 ms per 8 MiB
chunk) because sparser Huffman codes are harder to decode. L9 costs
~2.4× the compress CPU for no meaningful size win. The default is right.

## Verified end-to-end

Against a local mock server exercising the real `uploadSourcemaps` code
path (GET options + assemble + chunk upload + poll):

- `gzip,zstd` advertised → CLI sends `Content-Encoding: zstd`, field
`file`, zstd magic bytes `28 b5 2f fd`. ✅
- `gzip` only advertised → CLI sends **no header**, field `file_gzip`,
gzip magic `1f 8b`. (Self-hosted compat!) ✅
- `[]` kill-switch → CLI sends no header, field `file`, no compression
magic. ✅
- `brotli` only advertised → CLI logs debug, falls back to plain. ✅

Three uploads to the same mock server — all returned 200, all landed the
right bytes under the right field with the right headers.

## Tests

Added `test/lib/api/sourcemaps.test.ts` (9 tests, 20 assertions, first
test file for this module):

- `pickUploadEncoding` — preference order, fallback to `undefined`,
unknown-codec handling.
- `encodeChunk` — round-trip via `zlib.gunzipSync` /
`Bun.zstdDecompressSync` + **magic-byte assertions** (`1f 8b` for gzip,
`28 b5 2f fd` for zstd) to catch codec mix-ups in future refactors.
- `ChunkServerOptionsSchema` — accepts empty compression list and
`["gzip", "zstd"]`.

`bun run test:unit` → **5685 pass / 0 fail** (+9 new). Lint + typecheck
clean.

## Companion PR

Server side: getsentry/sentry#113760 — adds the matching
`Content-Encoding: zstd` acceptance + `compression: ["gzip", "zstd"]`
advertisement on the chunk-upload endpoint.
Three mypy errors were flagged on the initial PR:

1. `_read_bounded` accepted `IO[bytes]` but callers pass `GzipFile` and
   zstd's `stream_reader`, neither of which satisfy that protocol.
   Introduce a minimal `_Reader` protocol with just `read(n)`.
2. `get_files` yielded `IO[bytes]` but the `chunk.size` access at the
   call site requires Django's `UploadedFile`/our chunk wrappers which
   have `.size`. Introduce `_SizedChunk` protocol with `size: int` +
   `name: str` + `read(n)`.
3. The `files` buffer list typed with the new `_SizedChunk` protocol.

No runtime behavior change; all 26 tests pass, mypy clean.
Comment thread src/sentry/api/endpoints/chunk.py
Code review caught that the initial PR caught zstandard.ZstdError on
invalid zstd input but left gzip's analogous exceptions unhandled, so
a malformed gzip payload returned a 500 instead of a 400.

Now also catches:
  - gzip.BadGzipFile: bad magic bytes / bad header
  - EOFError: stream truncated mid-decompression

This covers BOTH gzip paths:
  - Content-Encoding: gzip on the standard `file` field (new path)
  - The legacy `file_gzip` multipart field (backwards-compat path)

Added 3 tests:
  - test_upload_invalid_gzip_payload_content_encoding
  - test_upload_invalid_gzip_payload_legacy_field
  - test_upload_truncated_gzip_payload (valid header, missing tail)

All 29 tests pass, mypy clean.
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 646ddf2. Configure here.

Comment thread src/sentry/api/endpoints/chunk.py Outdated
- Drop explicit type annotations that original file didn't have
  (GzipChunk.__init__, ZstdChunk.__init__, get_files, files/checksums
  locals, compression). The surrounding file is mostly untyped at these
  sites, so the annotations just added line noise.
- Drop the _Reader and _SizedChunk Protocols that only existed to satisfy
  mypy errors introduced by the annotations above. Removing the
  annotations removed the errors too; mypy stays clean.
- Drop getattr(file, "name", "") fallback in GzipChunk/ZstdChunk and
  restore direct file.name access matching the original GzipChunk. The
  file always has .name; the fallback was defensive slop.
- Drop the second reader.read(1) probe in _read_bounded; read(limit+1)
  followed by the length check is sufficient, and the extra probe was
  documented as 'shouldn't happen in practice'.
- Trim over-verbose docstrings (CHUNK_UPLOAD_COMPRESSION header comment,
  get_files docstring, post docstring additions, per-except comments).
- Inline the 'import gzip as gzip_mod' in the truncated-gzip test --
  gzip is already imported at module level.

mypy clean, 29 tests pass, pre-commit clean. Net diff now +169/-30 on
chunk.py (was +199/-30).
Comment thread src/sentry/api/endpoints/chunk.py
BYK and others added 2 commits April 23, 2026 13:04
Follow-up review caught that the single `reader.read(limit + 1)`
call in `_read_bounded` relies on readers filling the buffer in one
call. That holds for `GzipFile.read(n)` (BufferedIOBase semantics),
but `zstandard.ZstdDecompressor().stream_reader(file).read(n)` is
documented to potentially short-read when its underlying source
does -- for `SimpleUploadedFile`/`BytesIO` it doesn't in practice,
but a file-backed `TemporaryUploadedFile` could. Without a loop,
a bomb payload that happens to straddle a short-read boundary
could slip past the 8 MiB cap.

Fixed by looping until EOF or the cap is exceeded. Peak memory
stays bounded at `limit + 1` bytes.

Also pinned error-response bodies on the negative tests so a
future typo in the error string would be caught.
Cursor Bugbot caught: `zstandard.ZstdDecompressor().stream_reader(file)`
defaults to `read_across_frames=False`, so a client sending a
multi-frame zstd payload (concatenated frames) would have only the
first frame decoded -- rest silently discarded -- producing a
confusing downstream checksum mismatch instead of correct
decompression.

`GzipFile.read()` in `GzipChunk` already reads across concatenated
gzip members, so the two codecs now behave consistently. Matches
the convention established in
`src/sentry/models/eventattachment.py:179` which passes the same
flag for the same reason.

Added `test_upload_content_encoding_zstd_multi_frame` which
concatenates two independent zstd frames and asserts the decoded
payload is the concatenation of both plaintexts (30/30 tests).
Copy link
Copy Markdown
Member

@armenzg armenzg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good.
Feel free to ignore the logging suggestions. It's a pet-peeve of mine.
My only question for you is the option gate in the post method.

Comment thread src/sentry/api/endpoints/chunk.py Outdated
Comment thread src/sentry/api/endpoints/chunk.py Outdated
Comment thread src/sentry/api/endpoints/chunk.py Outdated
Comment thread src/sentry/api/endpoints/chunk.py Outdated
Comment thread src/sentry/api/endpoints/chunk.py Outdated
Comment thread src/sentry/api/endpoints/chunk.py
Per review feedback, use logger.warning (not info) for HTTP 400
"chunkupload.end" log lines in the paths touched by this PR so
they page/alert at the right level. Extends to the 2 pre-existing
sites that now live inside the new try/except block for
consistency. The 2 pre-existing sites OUTSIDE this PR's surface
(too-many-chunks, from_files OSError) are left on logger.info
unchanged to keep the diff focused.
@BYK BYK merged commit d4814b0 into master Apr 24, 2026
56 checks passed
@BYK BYK deleted the byk/chunk-upload-zstd branch April 24, 2026 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants