Skip to content

Fix TarReader handling of GNU sparse format 1.0 (PAX) — resolve GNU.sparse.name and GNU.sparse.realsize#125283

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/fix-gnu-sparse-format-handling
Draft

Fix TarReader handling of GNU sparse format 1.0 (PAX) — resolve GNU.sparse.name and GNU.sparse.realsize#125283
Copilot wants to merge 3 commits intomainfrom
copilot/fix-gnu-sparse-format-handling

Conversation

Copy link
Contributor

Copilot AI commented Mar 6, 2026

TarReader was ignoring GNU.sparse.name and GNU.sparse.realsize PAX extended attributes, causing ~46% of entries from bsdtar-created archives (e.g., .NET SDK tarballs built on macOS/APFS) to expose internal placeholder paths like GNUSparseFile.0/real-file.dll and incorrect sizes.

Changes

TarHeader.cs

  • Add PaxEaGnuSparseName (GNU.sparse.name) and PaxEaGnuSparseRealSize (GNU.sparse.realsize) constants
  • Add _gnuSparseRealSize field, separate from _size (which drives archive data stream reading) to avoid corrupting stream positioning
  • Propagate _gnuSparseRealSize in the copy constructor used for format conversion

TarHeader.Read.csReplaceNormalAttributesWithExtended()

  • After resolving path_name, override with GNU.sparse.name if present (replaces the GNUSparseFile.0/… placeholder with the real path)
  • After resolving size_size, capture GNU.sparse.realsize into _gnuSparseRealSize without touching _size

TarEntry.cs

  • Length returns _gnuSparseRealSize when set, otherwise falls back to existing behavior
// Before: entry.Name == "GNUSparseFile.0/dotnet.dll", entry.Length == 512 (stored sparse size)
// After:  entry.Name == "dotnet.dll",                  entry.Length == 1048576 (real file size)
using var reader = new TarReader(archiveStream);
TarEntry entry = reader.GetNextEntry();
Console.WriteLine(entry.Name);   // dotnet.dll
Console.WriteLine(entry.Length); // 1048576

TestTarReader.GetNextEntry.Tests.cs

  • GnuSparse10Pax_NameAndLengthResolvedFromExtendedAttributes (both copyData variants): verifies resolved name, real size from GNU.sparse.realsize, and that DataStream still contains only the stored sparse bytes (confirming _size was not overridden)
Original prompt

This section details on the original issue you should resolve

<issue_title>TarReader doesn't handle GNU sparse format 1.0 (PAX) - exposes GNUSparseFile.0 placeholder paths</issue_title>
<issue_description>## Description

System.Formats.Tar.TarReader does not handle GNU sparse format 1.0 entries encoded via PAX extended attributes. When reading such entries, TarEntry.Name returns the internal placeholder path (containing GNUSparseFile.0) instead of the real file name, and TarEntry.Length returns the stored (sparse) size rather than the real file size.

GNU sparse format 1.0 stores the real name and size in PAX extended attributes:

  • GNU.sparse.name — the real file path
  • GNU.sparse.realsize — the real file size

TarHeader.ReplaceNormalAttributesWithExtended() processes standard PAX attributes like path, size, mtime, etc., but does not process GNU.sparse.name or GNU.sparse.realsize.

How this occurs in practice

macOS ships bsdtar (libarchive), which detects sparse files by default during archive creation. .NET DLLs on APFS have zero-filled PE alignment sections that APFS stores as filesystem holes, causing bsdtar to treat them as sparse and encode them with the GNU sparse PAX format.

The tar command producing the affected archive was:

tar -cf - . | pigz > output.tar.gz

When .NET's TarReader reads these archives, ~46% of entries have incorrect names containing GNUSparseFile.0.

Reproduction Steps

Option 1 — With an affected tar.gz file

Download an affected tarball (a .NET SDK built on macOS):
dotnet-sdk-11.0.100-ci-osx-x64.tar.gz

Then run the repro program (below) against it.

Option 2 — Create a sparse tar.gz on macOS

On a Mac, create a sparse file and archive it:

# Create a file with sparse holes
dd if=/dev/zero of=sparse.bin bs=1 count=0 seek=1048576
echo "hello" >> sparse.bin

# Archive it (bsdtar detects sparse by default)
tar -czf sparse.tar.gz sparse.bin

Then read it on any platform with the repro program below.

Repro Program

Program.cs:

using System.Formats.Tar;
using System.IO.Compression;

if (args.Length == 0)
{
    Console.Error.WriteLine("Usage: dotnet run -- <path-to-tarball.tar.gz>");
    return 1;
}

string path = args[0];
if (!File.Exists(path))
{
    Console.Error.WriteLine($"File not found: {path}");
    return 1;
}

Console.WriteLine($"Reading: {path}");
Console.WriteLine();

int totalEntries = 0;
int sparseEntries = 0;

using FileStream fs = File.OpenRead(path);
using GZipStream gz = new(fs, CompressionMode.Decompress);
using TarReader reader = new(gz);

while (reader.GetNextEntry() is TarEntry entry)
{
    totalEntries++;

    if (entry is PaxTarEntry pax
        && pax.ExtendedAttributes.TryGetValue("GNU.sparse.name", out string? realName))
    {
        sparseEntries++;

        if (sparseEntries <= 5)
        {
            Console.WriteLine($"Entry #{totalEntries}:");
            Console.WriteLine($"  entry.Name (WRONG): {entry.Name}");
            Console.WriteLine($"  GNU.sparse.name   : {realName}");

            if (pax.ExtendedAttributes.TryGetValue("GNU.sparse.realsize", out string? realSize))
            {
                Console.WriteLine($"  entry.Length       : {entry.Length}");
                Console.WriteLine($"  GNU.sparse.realsize: {realSize}");
            }
            Console.WriteLine();
        }
    }
}

Console.WriteLine($"Total entries : {totalEntries}");
Console.WriteLine($"Sparse entries: {sparseEntries}");

if (sparseEntries > 0)
{
    Console.WriteLine();
    Console.WriteLine("BUG: TarReader exposes internal 'GNUSparseFile.0' placeholder paths");
    Console.WriteLine("     instead of using the real name from GNU.sparse.name.");
}

return sparseEntries > 0 ? 1 : 0;

tar-repro.csproj:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net9.0</TargetFramework>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>enable</Nullable>
  </PropertyGroup>
</Project>

Expected behavior

For entries with GNU.sparse.name and GNU.sparse.realsize PAX extended attributes:

  • entry.Name should return the value of GNU.sparse.name (e.g., ./shared/Microsoft.NETCore.App/11.0.0-ci/Microsoft.CSharp.dll)
  • entry.Length should return the value of `GNU.sparse.r...

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Copilot AI and others added 2 commits March 6, 2026 22:52
…rse.name and GNU.sparse.realsize

Co-authored-by: lewing <24063+lewing@users.noreply.github.com>
…sertions for data stream integrity

Co-authored-by: lewing <24063+lewing@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix TarReader to handle GNU sparse format 1.0 correctly Fix TarReader handling of GNU sparse format 1.0 (PAX) — resolve GNU.sparse.name and GNU.sparse.realsize Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TarReader doesn't handle GNU sparse format 1.0 (PAX) - exposes GNUSparseFile.0 placeholder paths

2 participants