Skip to content

Fix ZipArchive Update removing data descriptors#126447

Open
bwinsley wants to merge 26 commits intodotnet:mainfrom
bwinsley:126344
Open

Fix ZipArchive Update removing data descriptors#126447
bwinsley wants to merge 26 commits intodotnet:mainfrom
bwinsley:126344

Conversation

@bwinsley
Copy link
Copy Markdown

@bwinsley bwinsley commented Apr 2, 2026

Fix ZipArchive Update mode corruption when entries have data descriptors (bit 3)

Fixes #126344

Description

When opening an existing ZIP in ZipArchiveMode.Update and disposing (which triggers the write-back), the offset calculations in WriteFileCalculateOffsets and the stream seeking in WriteLocalFileHeaderAndDataIfNeeded/WriteLocalFileHeaderAndDataIfNeededAsync did not account for the data descriptor bytes (12–24 bytes depending on format) that follow compressed data when general purpose bit flag bit 3 is set. This caused new or shifted entries to overwrite data descriptors, corrupting the archive.

This commonly affects archives created on non-seekable streams (which always set bit 3) or by external tools such as Java's ZipOutputStream, Azure blob storage SDKs, and similar.

This is a .NET 10 regression introduced by PR #102704's selective-rewrite optimization.

Changes

Offset calculation fix (ZipArchive.cs, ZipArchive.Async.cs)

  • ComputeEntryEndOffsets() — New method that precomputes EndOfLocalEntryData for each originally-in-archive entry in a single O(n) reverse pass. Since _entries is sorted by local header offset, each entry's end boundary is the next original entry's OffsetOfLocalHeader, or _centralDirectoryStart for the last entry. This naturally accounts for any trailing data (data descriptors, padding, etc.) without any stream I/O.
  • WriteFileCalculateOffsets — Now uses entry.EndOfLocalEntryData instead of GetOffsetOfCompressedData() + CompressedLength, which did not include data descriptor bytes.
  • Both WriteFile() and WriteFileAsync() call ComputeEntryEndOffsets() before processing entries.

Metadata-only seek fix (ZipArchiveEntry.cs, ZipArchiveEntry.Async.cs)

  • After writing the local file header for an unchanged/metadata-only entry, the stream now seeks to the absolute EndOfLocalEntryData position instead of using a relative seek by _compressedSize. This single seek correctly advances past both compressed data and any trailing data descriptor, with zero stream reads.

Bit 3 (data descriptor flag) handling

  • WriteLocalFileHeaderInitialize now clears bit 3 only when the header is actually being written (not when it's skipped for unchanged entries) and only for seekable/empty-file paths. This means unchanged entries that skip header writing preserve their original bit 3 flag, keeping the local header, central directory, and on-disk data descriptor consistent — no save/restore or seek-back patching needed.
  • For entries whose headers are rewritten (force-written after a rewrite point), bit 3 is correctly cleared since the sizes are written directly into the header. Any orphaned data descriptor bytes become harmless dead space — compliant readers ignore them when bit 3 is not set.

Testing

Three new regression tests in zip_UpdateTests.cs, all parameterized on bool async via [Theory]/[MemberData] to exercise sync Dispose and async DisposeAsync code paths:

Update_DataDescriptorSignature_IsCorrectlyWrittenAndPreserved

Creates a 3-entry archive via a non-seekable stream (forces bit 3 / data descriptors) with NoCompression, reopens in Update mode, adds a new entry. Performs a structural binary walk of the updated archive: EOCD → central directory → local file headers, verifying bit 3 flags on original entries and data descriptor signatures at the correct computed byte offsets.

Update_DataDescriptorWithDeletedEntry_PreservesArchive

Creates a 5-entry archive via a non-seekable stream, reopens in Update mode, deletes a middle entry. Verifies the remaining 4 entries are readable with correct data. Exercises offset recalculation in ComputeEntryEndOffsets when entries with data descriptors are removed and subsequent entries must shift.

Update_DataDescriptorWithMetadataOnlyChange_PreservesArchive

Creates a 3-entry archive via a non-seekable stream, reopens in Update mode, changes LastWriteTime on the middle entry without opening its stream (exercises the metadata-only rewrite path), adds a new entry. Reopens in Read mode and verifies all original entries' data intact, the metadata change was preserved, and the new entry is correct.

Test results

All 1739 tests pass with 0 errors, 0 failures (1733 existing + 6 new test runs).

Copilot AI review requested due to automatic review settings April 2, 2026 02:49
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 2, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a regression test in System.IO.Compression to validate ZIP archives created in “streaming” mode (non-seekable stream, data-descriptor bit set) remain structurally valid after reopening in ZipArchiveMode.Update and adding an entry.

Changes:

  • Added System.Buffers.Binary usage to inspect ZIP bytes.
  • Added a new [Theory] test that creates a streaming ZIP, validates data-descriptor markers, updates the archive by adding an entry, then validates readability/structure again.

Copilot AI review requested due to automatic review settings April 2, 2026 05:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Copilot AI review requested due to automatic review settings April 2, 2026 06:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

@bwinsley
Copy link
Copy Markdown
Author

bwinsley commented Apr 2, 2026

@dotnet-policy-service agree company="Displayr"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

@rzikm
Copy link
Copy Markdown
Member

rzikm commented Apr 7, 2026

For entries whose headers ARE rewritten (metadata-only changes like LastWriteTime), this would leave bit 3 = 0 in the on-disk local header while the data descriptor bytes remain physically present — creating orphaned bytes a sequential parser cannot navigate.

One could argue that entries with DataDescriptor bit aren't sequentially readable anyway, since the size of the entry is not known up-front (size in the local header is 0), you don't know where the DataDescriptor is (looking for the magic bytes header is not reliable).

I agree that we should preserve the bits and the data descriptor though.

Comment on lines +369 to +371
ComputeEntryEndOffsets();

for (int i = 0; i < _entries.Count; i++)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid iterating over the _entries twice?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not without heavy refactoring as ComputeEntryEndOffsets() iterates backwards and this loop on L371 iterates forwards and needs the values from ComputeEntryEndOffsets(). This should still be O(n) anyways and ComputeEntryEndOffsets() is not a very heavy/intensive function

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, you can do something like

    _entries[i].EndOfLocalEntryData = i < _entries.Length - 1
        ? _entries[i + 1].OffsetOfLocalHeader
        : _centralDirectoryStart

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might actually make sense to do this at the time of actually reading the central directory for the first time, that way we don't need to care about OriginallyInTheArchive

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent pick up. I've moved it to where we read the central directory now

Copilot AI review requested due to automatic review settings April 7, 2026 13:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings April 8, 2026 04:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@bwinsley
Copy link
Copy Markdown
Author

bwinsley commented Apr 8, 2026

@rzikm @alinpahontu2912 Have implemented suggestions where possible and left replies where not. Please let me know if anything is missing or needs updating!


entriesToWrite = new(_entries.Count);
foreach (ZipArchiveEntry entry in _entries)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this ?

entriesToWrite = new(_entries.Count);

foreach (ZipArchiveEntry entry in _entries)
for (int i = 0; i < _entries.Count; i++)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question about changing this foreach loop


// return value is true if we allocated an extra field for 64 bit headers, un/compressed size
private async Task<bool> WriteLocalFileHeaderAsync(bool isEmptyFile, bool forceWrite, CancellationToken cancellationToken)
private async Task<bool> WriteLocalFileHeaderAsync(bool isEmptyFile, bool forceWrite, CancellationToken cancellationToken, bool preserveDataDescriptor = false)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually cancellationtoken is the last parameter, what do we think about this @rzikm ?

// directly into the header, so a data descriptor is not needed.
if (isEmptyFile || _archive.ArchiveStream.CanSeek)
{
if (preserveDataDescriptor && (_generalPurposeBitFlag & BitFlagValues.DataDescriptor) != 0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(_generalPurposeBitFlag & BitFlagValues.DataDescriptor) != 0 won't this always be true if preserveDataDescriptor is true ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.IO.Compression community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ZipArchive Update mode corrupts archive when adding entries to zip with data descriptors (bit 3 set)

5 participants