Fix ZstandardStream truncating multi-frame zstd responses to the first frame#129047
Fix ZstandardStream truncating multi-frame zstd responses to the first frame#129047christosk92 wants to merge 3 commits into
Conversation
ZSTD_decompressStream returns 0 at the end of each frame, not the end of the stream, but ZstandardDecoder.Decompress treated that as end-of-stream via a permanent _finished latch. A zstd stream made of multiple concatenated frames (valid per RFC 8878) -- e.g. large HTTP Content-Encoding: zstd responses -- was therefore silently truncated to its first frame, surfacing downstream as truncated payloads (e.g. JSON parse errors at exactly 65536 bytes). ZstandardStream now continues decoding subsequent frames on the same native context (mirroring GZipStream multi-member handling), distinguishes a following frame from trailing data via the frame magic number, and only treats end-of-input as a clean end when at a frame boundary. Adds regression tests for concatenated frames (buffered, split across reads, and with trailing data). Fixes dotnet#129038.
|
Tagging subscribers to this area: @karelz, @dotnet/area-system-io-compression |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates Zstandard decompression to correctly handle multiple concatenated zstd frames (common in HTTP Content-Encoding: zstd) without truncation, and adds regression coverage for multi-frame and trailing-data scenarios.
Changes:
- Extend
ZstandardStreamdecompression to continue decoding across concatenated frames and distinguish frames vs. trailing non-zstd data. - Add tests to verify full output for concatenated frames (including across fragmented reads) and correct handling of trailing data after the final frame.
- Add decoder support method to clear managed end-of-frame state between frames.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/libraries/System.IO.Compression/tests/Zstandard/CompressionStreamUnitTests.Zstandard.cs | Adds regression tests for concatenated frames, across-read boundaries, and trailing-data handling. |
| src/libraries/System.IO.Compression/src/System/IO/Compression/Zstandard/ZstandardStream.Decompress.cs | Implements multi-frame decoding logic and adjusts truncation detection at frame boundaries. |
| src/libraries/System.IO.Compression/src/System/IO/Compression/Zstandard/ZstandardDecoder.cs | Adds PrepareForNextFrame to clear managed end-of-frame state for concatenated streams. |
| if (destination.IsEmpty) | ||
| { | ||
| // The caller provided a zero-byte buffer. This is typically done in order to avoid allocating/renting | ||
| // a buffer until data is known to be available. We don't have perfect knowledge here, as _decoder.Decompress | ||
| // will return DestinationTooSmall whether or not more data is required. As such, we assume that if there's | ||
| // any data in our input buffer, it would have been decompressible into at least one byte of output, and | ||
| // otherwise we need to do a read on the underlying stream. This isn't perfect, because having input data | ||
| // doesn't necessarily mean it'll decompress into at least one byte of output, but it's a reasonable approximation | ||
| // for the 99% case. If it's wrong, it just means that a caller using zero-byte reads as a way to delay | ||
| // getting a buffer to use for a subsequent call may end up getting one earlier than otherwise preferred. | ||
| Debug.Assert(status == OperationStatus.DestinationTooSmall); | ||
| if (_buffer.ActiveLength != 0) | ||
| { | ||
| Debug.Assert(bytesWritten == 0); | ||
| return true; | ||
| } | ||
| } |
There was a problem hiding this comment.
False positive. ZstandardDecoder.Decompress returns DestinationTooSmall (never NeedMoreData) whenever destination is empty, via an unconditional guard before any native call. So at this point the status is always DestinationTooSmall, including for 0-byte reads. This matches the pre-existing assert, so leaving it as-is.
| public override Task<int> ReadAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken) => | ||
| Task.FromResult(Read(buffer.AsSpan(offset, count))); | ||
|
|
||
| public override ValueTask<int> ReadAsync(Memory<byte> buffer, CancellationToken cancellationToken = default) => | ||
| new ValueTask<int>(Read(buffer.Span)); |
There was a problem hiding this comment.
Fixed. Both ReadAsync overrides now return a canceled Task/ValueTask when the token is already canceled.
| // Fewer than ZstdFrameMagicLength bytes remain: not enough to tell whether another | ||
| // frame follows (its magic number may be split across reads) or this was the last | ||
| // frame. Hand back any output now and resolve on the next call / underlying read. | ||
| // Because we're at a frame boundary, end-of-input is treated as a clean end rather | ||
| // than truncation (see _atFrameBoundary checks in Read/ReadAsync). | ||
| lastResult = OperationStatus.NeedMoreData; | ||
| return bytesWritten != 0; |
There was a problem hiding this comment.
Good catch. Fixed the inconsistency too: at a frame boundary, end-of-input now reports Done so 1 to 3 trailing bytes are rewound on a seekable stream, matching the >= 4 byte case. Added a test for a frame followed by 1 to 3 trailing bytes.
- Report Done at end-of-input when at a frame boundary so 1-3 trailing bytes after the final frame are rewound on a seekable base stream, consistent with the >= 4 byte trailing-data case. - SingleByteReadStream.ReadAsync test helper now honors the CancellationToken. - Add regression test for a frame followed by 1-3 trailing bytes (seekable).
| if (destination.IsEmpty) | ||
| { | ||
| // The caller provided a zero-byte buffer. This is typically done in order to avoid allocating/renting | ||
| // a buffer until data is known to be available. We don't have perfect knowledge here, as _decoder.Decompress | ||
| // will return DestinationTooSmall whether or not more data is required. As such, we assume that if there's | ||
| // any data in our input buffer, it would have been decompressible into at least one byte of output, and | ||
| // otherwise we need to do a read on the underlying stream. This isn't perfect, because having input data | ||
| // doesn't necessarily mean it'll decompress into at least one byte of output, but it's a reasonable approximation | ||
| // for the 99% case. If it's wrong, it just means that a caller using zero-byte reads as a way to delay | ||
| // getting a buffer to use for a subsequent call may end up getting one earlier than otherwise preferred. | ||
| Debug.Assert(status == OperationStatus.DestinationTooSmall); |
| Assert.Equal(expected.Length, output.Length); | ||
| Assert.Equal(expected, output.ToArray()); |
|
@dotnet-policy-service agree |
The while(true) loop assigns lastResult before any use, so the pre-loop initializer was a dead store (IDE0059, treated as a build error). Also drop the redundant re-assignment of Done in the trailing-data branch.
Fixes #129038.
Problem
A zstd stream may be a sequence of frames concatenated back-to-back (RFC 8878 §3), and many encoders/CDNs emit one frame per buffer — so large
Content-Encoding: zstdHTTP responses commonly arrive as multiple ~64 KB frames.ZstandardStreamdecoded only the first frame and silently dropped the rest, surfacing downstream as truncated payloads (in the linked issue,System.Text.Jsonfailing at exactlyBytePositionInLine: 65536).Root cause is in
ZstandardDecoder.Decompress:ZSTD_decompressStreamreturning0means end-of-frame, not end-of-stream (lib/zstd.h: "0 when a frame is completely decoded and fully flushed")._finishedis a one-way latch andDecompressstarts withif (_finished) return OperationStatus.Done;, so once the first frame completes every subsequent frame is discarded.Fix
This mirrors the existing multi-member handling in
GZipStream/DeflateStream(Inflater.ResetStreamForLeftoverInput+DeflateStream.InflatorIsFinished), adapted to zstd:ZstandardStream.TryDecompressnow loops across frames. When the decoder reportsDoneand more input is available that begins with a zstd frame magic number, it continues decoding the next frame on the same native context. No reset is needed between frames —ZSTD_decompressStreamautomatically begins the next frame on the following call (so window-size / dictionary settings carry over), which keeps this consistent with the native streaming contract.ZstandardDecoder.PrepareForNextFrame()(internal) clears only the managed end-of-frame latch.StartsWithZstdFramepeeks the 4-byte frame magic (standard0xFD2FB528, or skippable0x184D2A50–0x184D2A5F) to distinguish a following frame from trailing non-zstd data after the final frame, which is left untouched — matching howDeflateStreamtreats data after the last gzip member._atFrameBoundaryflag makes end-of-input a clean end (rather than truncated data) only when the last frame finished cleanly, so the existing strict-validation truncation detection still fires for genuinely truncated frames.Only the streaming path was affected. The static one-shot helpers are already multi-frame-correct (
ZstandardDecoder.TryDecompress→ZSTD_decompress, which concatenates natively;TryGetMaxDecompressedLength→ZSTD_decompressBound, which is multi-frame-aware), so they are unchanged.Tests
Added to
CompressionStreamUnitTests.Zstandard.cs(sync + async each):ZstandardStream_ConcatenatedFrames_DecompressesAllFrames— two concatenated frames whose payloads span the 64 KB internal buffer; asserts the full concatenated output (this fails without the fix).ZstandardStream_ConcatenatedFrames_AcrossReads_DecompressesAllFrames— a non-seekable stream returning one byte per read, so each frame magic is discovered across multiple underlying reads (exercises the HttpClient-style no-rewind path and the magic-split-across-reads case).ZstandardStream_FrameFollowedByTrailingData_StopsAtEndOfFrame— a single frame followed by non-zstd trailing bytes; asserts the payload decodes and the trailing bytes remain available on the (seekable) base stream.Existing tests, including
StreamTruncation_IsDetected, should continue to pass.Notes
I haven't built the full runtime locally, so I'm relying on CI to validate. Happy to adjust the approach — in particular, whether the frame-to-frame continuation should live in
ZstandardStream(as here, parallelingDeflateStream) or insideZstandardDecoder, and whether trailing-data tolerance should matchGZipStreamexactly. A deterministic, dependency-free repro is linked from the issue: https://github.com/christosk92/zstd-net11-repro