Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tar APIs pending feedback to address #68230

Open
45 of 54 tasks
carlossanlop opened this issue Apr 19, 2022 · 14 comments
Open
45 of 54 tasks

Tar APIs pending feedback to address #68230

carlossanlop opened this issue Apr 19, 2022 · 14 comments
Assignees
Milestone

Comments

@carlossanlop
Copy link
Member

carlossanlop commented Apr 19, 2022

Initial PR changes were introduced here: #67883

The following is pending feedback I need to address (not urgent, for 8.0, mostly just additional testing coverage):

Perf improvements:

Additional test coverage:

  • Make sure that when archiving an executable, then extracting it, the executable mode bit gets properly preserved.
  • Add test that reads an archive containing an unsupported entry type (no writing).
  • Add test that ensures a hidden file can be used to create an entry from file.
  • Verify these GNU fields are written in the data stream: AllGnuUnused = Offset + LongNames + Unused + Sparse + IsExtended + RealSize.
  • Add test that ensures that a GNU archive (generated with tar tool) containing unused GNU bytes (sparse, etc) get preserved when written to another GNU archive.
  • Add test to extract entries to disk one by one.
  • Add test with entries containing relative segments that would cause extraction to happen outside of destination directory.
  • Add test that opens one file concurrently, to prevent Allow sharing of input tar file for read #68361 from happening again.
  • Add test that verifies that adding a Windows path with '' separators changes them to '/'.
  • Consider enabling CA1305: Implement Tar APIs #67883 (comment)
  • Add to the runtime-assets script an unarchived test with both a longlink and a longpath.
  • Find out how (if possible) to add a file as a hardlink, because otherwise, it can only be created directly as an entry, not by reading it from the filesystem.
  • Share WrappedStream in Common as suggested by @danmoseley .

Done:

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 19, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Apr 19, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

I received feedback in my first Tar PR #67883 that I would like to address in a follow-up PR.

  • Address chmod suggestions by @tmds .
  • Address stackalloc and stream-related suggestions by @stephentoub .
  • Implement UName and GName from the current process' gid and uid, maybe use GetNameFromUid and (new) GetNameFromGid.
  • Make sure I have full tests for individual extraction of all data types.
  • Add a symlink test that has a link with an absolute path to a target.
  • Make sure that when archiving an executable, then extracting it, the executable mode bit gets properly preserved.
  • Add test that reads an archive containing an unsupported entry type (no writing).
  • Add test that advances DataStream a bit before extracting into filesystem, verify data was written starting from that position.
  • Add test that ensures a hidden file can be used to create an entry from file.
  • Verify these GNU fields are written in the data stream: AllGnuUnused = Offset + LongNames + Unused + Sparse + IsExtended + RealSize.
  • Add test that ensures that a GNU archive (generated with tar tool) containing unused GNU bytes (sparse, etc) get preserved when written to another GNU archive.
  • Add simple tests to verify uncompressing a tar.gz works.
  • Add test to extract entries to disk one by one.
  • Verify extended attributes in all PAX tests.
  • Add Ustar checksum test.
  • Add Pax checksum test.
  • Add Gnu checksum test.
  • Find all the stream.Read calls and make sure we read all that was expected (use similar logic to what's being done in ReadOrThrow).
  • Investigate why retrieving the devmajor and devminor with syscalls are not matching the expected values.
  • Add to the runtime-assets script an unarchived test with both a longlink and a longpath.
  • Add test that reads an archive from a file with both a longlink and a longpath.
  • Address documentation suggestions, make sure all exceptions and remarks are properly filled out.
  • Remove src csproj nullable enable after @eerhardt merges his PR.
  • Find out how (if possible) to add a file as a hardlink, because otherwise, it can only be created directly as an entry, not by reading it from the filesystem.
  • Think of a better way to verify DateTimeOffsets in tests.
  • Implement async TarFile APIs.
  • Implement async TarEntry APIs.
  • Implement async TarWriter APIs.
  • Implement async TarReader APIs.
  • Share WrappedStream in Common as suggested by @danmoseley .
  • Discuss if the TarFile extraction methods should extract something and throw at the end.
  • Discuss if we want to merge DevMajor and DevMinor into a single public API.
Author: carlossanlop
Assignees: carlossanlop
Labels:

area-System.IO, untriaged

Milestone: -

@carlossanlop carlossanlop removed the untriaged New issue has not been triaged by the area owner label Apr 19, 2022
@carlossanlop carlossanlop added this to the 7.0.0 milestone May 10, 2022
@Bio2hazard
Copy link

On .NET 7 Preview 4,
I've been testing the Tar APIs on some tar.gz files created on our teamcity agents (in ustar format) and ran into 2 issues:

  • When using TarReader over a unseekable GZipStream it manages to read the first entry, but fails on the second one with an EndOfStreamException emitted in TarHelpers.cs line 40.
  • When using TarReader over a seekable MemoryStream, it fails when it encounters a long filename that the tar format splits into a separate header. It detects it as a GNU format instead of ustar and fails with Entry 'blah' was expected to be in the GNU format, but did not have the expected version data.

Hope that helps.

@carlossanlop
Copy link
Member Author

Thank you for your report, @Bio2hazard ! Any chance you can share a sample tar for each one of your experiments?

@Bio2hazard
Copy link

@carlossanlop yes I was able to reproduce it without sensitive information, thankfully. Here it is:
testarchive.tar.gz

It has 3 files in it:

  • testfile1.txt which should have 5 paragraphs of lorem ipsum
  • testfile3.txt which has the same content as testfile1.txt
  • testing-a-verylong.filename-that-cant-fit-in-a-single.tar-header-and-will-probably-fail-to-load-correctly.txt which just contains its own filename

I was trying to benchmark it, so I'll just drop the benchmarking code in here. Goal is just to get the content byte[] for every file in the archive.

using System.Formats.Tar;
using System.IO.Compression;
using BenchmarkDotNet.Attributes;

namespace Benchmarks7;

[GcServer(true)]
[GcForce(true)]
[MemoryDiagnoser]
public class ReadFromTarGz
{
    private static byte[] fileData;

    public ReadFromTarGz()
    {
	    fileData = File.ReadAllBytes(@"testarchive.tar.gz");
    }

    [Benchmark]
    public List<byte[]> WithSharpCompress()
    {
        var ret = new List<byte[]>();
        using (var fileStream = new MemoryStream(fileData))
        using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
        using (var tarReader = SharpCompress.Readers.Tar.TarReader.Open(stream))
        {
    	    while (tarReader.MoveToNextEntry())
    	    {
    		    var entry = tarReader.Entry;
                if (entry.IsDirectory)
    		    {
    			    continue;
    		    }

    		    byte[] entryData = new byte[entry.Size];
    		    using (var fileMemoryStream = new MemoryStream(entryData))
    		    using (var entryStream = tarReader.OpenEntryStream())
    		    {
    			    entryStream.CopyTo(fileMemoryStream);
    		    }
                ret.Add(entryData);
    	    }
        }

        return ret;
    }

    [Benchmark]
    public List<byte[]> WithSystemsFormatTar_PreDecompressed()
    {
	    var ret = new List<byte[]>();
	    using var decompressedStream = new MemoryStream();

	    using (var fileStream = new MemoryStream(fileData))
	    using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
	    {
		    stream.CopyTo(decompressedStream);
	    }

	    decompressedStream.Position = 0;

	    using (var tarReader = new System.Formats.Tar.TarReader(decompressedStream))
	    {
		    System.Formats.Tar.TarEntry entry;
		    while ((entry = tarReader.GetNextEntry()) != null)
		    {
			    if (entry.EntryType == TarEntryType.Directory)
			    {
				    continue;
			    }

			    byte[] entryData = new byte[entry.Length];
			    using (var fileMemoryStream = new MemoryStream(entryData))
			    {
				    entry.DataStream.CopyTo(fileMemoryStream);
			    }
			    ret.Add(entryData);
		    }
	    }

	    return ret;
    }

    [Benchmark]
    public List<byte[]> WithSystemsFormatTar_StreamedDecompression()
    {
	    var ret = new List<byte[]>();
	    using (var fileStream = new MemoryStream(fileData))
	    using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
	    using (var tarReader = new System.Formats.Tar.TarReader(stream))
	    {
		    System.Formats.Tar.TarEntry entry;
		    while ((entry = tarReader.GetNextEntry()) != null)
		    {
			    if (entry.EntryType == TarEntryType.Directory)
			    {
				    continue;
			    }

			    byte[] entryData = new byte[entry.Length];
			    using (var fileMemoryStream = new MemoryStream(entryData))
			    {
				    entry.DataStream.CopyTo(fileMemoryStream);
			    }
			    ret.Add(entryData);
		    }
	    }

	    return ret;
    }

    [Benchmark]
    public List<byte[]> WithSystemsFormatTar_StreamedDecompression_WithCopyData()
    {
	    var ret = new List<byte[]>();
	    using (var fileStream = new MemoryStream(fileData))
	    using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
	    using (var tarReader = new System.Formats.Tar.TarReader(stream))
	    {
		    System.Formats.Tar.TarEntry entry;
		    while ((entry = tarReader.GetNextEntry(true)) != null)
		    {
			    if (entry.EntryType == TarEntryType.Directory)
			    {
				    continue;
			    }

			    byte[] entryData = new byte[entry.Length];
			    using (var fileMemoryStream = new MemoryStream(entryData))
			    {
				    entry.DataStream.CopyTo(fileMemoryStream);
			    }
			    ret.Add(entryData);
		    }
	    }

	    return ret;
    }
}
  • WithSharpCompress works as expected
  • WithSystemsFormatTar_PreDecompressed fails with Exception (which I'll paste at the bottom of the message)
  • WithSystemsFormatTar_StreamedDecompression appears to work, but only contains a single result because of the aforementioned EndOfStreamException which gets caught internally, so it never gets to the long filename file
  • WithSystemsFormatTar_StreamedDecompression_WithCopyData throws the same exception as WithSystemsFormatTar_PreDecompressed

Exception:

System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
 ---> System.FormatException: Entry 'testing-a-verylong.filename-that-cant-fit-in-a-single.tar-header-and-will-probably-fail-to-load-cor' was expected to be in the GNU format, but did not have the expected version data.
   at System.Formats.Tar.TarHeader.ReadVersionAttribute(Span`1 buffer)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryProcessGnuMetadataHeader(TarHeader header, Boolean copyData, TarHeader& finalHeader)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Benchmarks7.ReadFromTarGz.WithSystemsFormatTar_PreDecompressed() in C:\Development\Benchmarks7\ReadFromTarGz.cs:line 65

I hope that helps!

@carlossanlop
Copy link
Member Author

carlossanlop commented May 19, 2022

Hi again @Bio2hazard ,

What tool did you use to generate the tar file inside your gz file?

I inspected the internal tar file with the Hex Editor HxD. The magic and version metadata fields of the first 2 entries indicate the archive is in ustar format, but then it contains an unexpected GNU format entry with the TarType LongLink in the 3rd position (and then a 4th entry containing the actual data, because the LongLink entry is a metadata entry). Hence why you're getting the FormatException thrown: the reader detects that the archive is malformed for containing mixed format entries, and it is expected that the exception is thrown when attempting to read the 3rd one.

Here is what the spec has to say about the magic and version fields depending on the archive format:

For POSIX archives (ustar and pax)

     magic   Contains the magic	value "ustar" followed by a NUL	byte to	indi-
	     cate that this is a POSIX standard	archive.  Full compliance re-
	     quires the	uname and gname	fields be properly set.

     version
	     Version.  This should be "00" (two	copies of the ASCII digit
	     zero) for POSIX standard archives.

And then a few lines below it says:

     Field termination is specified slightly differently by POSIX than by pre-
     vious implementations.  The magic,	uname, and gname fields	must have a
     trailing NUL. 

For GNU

 magic   The magic field holds the five characters "ustar" followed	by a
	     space.  Note that POSIX ustar archives have a trailing null.
version
	     The version field holds a space character followed	by a null.
	     Note that POSIX ustar archives use	two copies of the ASCII	digit
	     "0".

In other words:

  • For POSIX format archives (ustar or pax), the magic + version fields, which are one after the other, should look like: ustar\000.
  • In the GNU format, should look like: ustar \0.

I'm surprised that SharpCompress and 7-zip are able to open this malformed archive. This tells me that they are flexible when it comes to mixed format entries. The spec does not explicitly indicate the entries should all be in the same format, but I interpreted it this way considering there are metadata entries like L, K (GNU) or x, g (PAX) which precede other entries. Also, the Unix tar CLI tool does not mix format entries, and fails to write entries with long paths and long entries if you create an archive in v7 or ustar, because those formats do not support that.

Based on the above, I see one public API structure change we should make: If archives are allowed to have mixed entries as we are seeing in your example (or in other words, should not expect all entries to be of the same format), then we would have to remove the public Format property from TarReader, and we would have to move it to the base TarEntry class. That way, the user will have to check this property from each individual entry so they know to which format they should cast the returned entry.

Note: The TarWriter receives a Format in the constructor, and this is important, particularly when creating a PAX archive, because the user should be able to add global extended attributes which only show up at the beginning of the archive. So the constructor's format argument should stay. But if the user creates a V7 or a Ustar formatted archive with the writer, and suddenly decides to write an unsupported entry like a GnuTarEntry or a PaxTarEntry with a long path or a long link, we should still throw (current behavior), to prevent the user from intermixing formats, exactly like the Unix tar CLI tool does.

@carlossanlop
Copy link
Member Author

@Bio2hazard I'm really curious how you generated your archive, because even if I address the above by allowing intermixing formats, the magic and version of your entries are malformed in the 3rd entry: they indicate ustar instead of in gnu format, even though the entry type is L which is exclusive to the gnu format. And once again, I'm very surprised that SharpCompress and 7-zip were kind enough to allow it 😄 .

@Bio2hazard
Copy link

Bio2hazard commented May 19, 2022

@carlossanlop

Yikes, I was unaware that the archive is in a malformed format.

So the original file that I first encountered this problem with was created by TeamCity Enterprise. A build agent designated a folder as the output artifact, and TeamCity automatically turned the folder into a .tar.gz. I think these are the relevant docs in particular the target_archive part.

These are the relevant build logs:

[10:46:02] Publishing artifacts (13s)
[10:46:02] [Publishing artifacts] Collecting files to publish: [+:project/dist/dev=>project-dev-12345-67890-abcdefg.tar.gz]
[10:46:02] [Publishing artifacts] Creating archive project-dev-12345-67890-abcdefg.tar.gz (8s)
[10:46:02] [Creating archive project-dev-12345-67890-abcdefg.tar.gz] Creating /opt/jetbrains/TeamCity_agent/temp/buildTmp/TarPreprocessor1234567890123456789/project-dev-12345-67890-abcdefg.tar.gz

Of course, I can't exactly share our internal build artifacts, so what I did was:

  • Downloaded the build artifact onto my windows 10 machine (and ran into the reported errors when trying to benchmark)
  • Opened (not extracted) the .tar.gz archive with the 7z GUI
  • Navigated into the .tar file
  • Through the 7z GUI I modified the contents of the tar by adding the 3 test files, and deleting all the proprietary build artifacts
  • Then closed the 7z GUI window, which prompts 7z to ask if you want to persist the changes, which I accepted

image

My hope here was that modifying the file in this manner would cause 7z to use the same tar format/style as the original archive produced by TeamCity.

While I understand that that is a fairly ... unorthodox way of creating an archive, I am actually fairly confident it is a faithful reproduction. Debugging the actual TeamCity archive and the supplied test archive shows that they both fail in the same way with the same stack traces.

Furthermore, I had actually as a proof of concept created a very rudimentary manual tar extractor based on the GNU spec and it also got tripped up and failed at the long link part 😄

TeamCity is a big player in the CI/CD space, and the artifact publisher that produced this artifact is built-in and native. Given that I could not find any notable user discussions around problems with the artifact archives produced by TeamCity, I have to assume that it is very widely supported despite being malformed. Otherwise I am sure there would be plenty of discussions about it.

Please let me know if you need anything else. If you are concerned that the reproduction archive isn't accurate enough I can provide the header bytes of the original archive.

Edit: To further make sure the reproduction is accurate:

  • I took the original build artifact from teamcity
  • made a copy of it
  • opened the copy with the 7z GUI, removed a file and saved it
  • opened the copy with the 7z GUI a second time, re-added the file and saved it
  • compared all 512-byte tar headers between the original and the 7z modified copy: They all matched

@carlossanlop
Copy link
Member Author

Thanks for the information, @Bio2hazard . A few more questions:

So the original file that I first encountered this problem with was created by TeamCity Enterprise

Do you happen to know what was the original format of the archive? I couldn't find it in the documentation you shared. The reason I ask is because you then mentioned:

Through the 7z GUI I modified the contents of the tar by adding the 3 test files, and deleting all the proprietary build artifacts

Which I interpret as: you deleted all the original files that the TeamCity tool generated, then added new files using 7-zip. So it's 7-zip who's adding the (seemingly) malformed files, not TeamCity.

Furthermore, I had actually as a proof of concept created a very rudimentary manual tar extractor based on the GNU spec and it also got tripped up and failed at the long link part

Do you mean you tried reading the entries using the new TarReader API, or did you use some other tools? If you used TarReader and it failed, it was most likely because of the malformed magic and version. I should make it more flexible and avoid failing if the magic is not well formed, see if that helps with your case.

Given that I could not find any notable user discussions around problems with the artifact archives produced by TeamCity, I have to assume that it is very widely supported despite being malformed.

I don't think the archive created with the TeamCity tool is generating malformed entries. I think it was 7-zip, since you did not preserve any of the original artifacts.

  • opened the copy with the 7z GUI, removed a file and saved it
  • opened the copy with the 7z GUI a second time, re-added the file and saved it
  • compared all 512-byte tar headers between the original and the 7z modified copy: They all matched

These 3 steps are confusing: Are you saying that some of the files in the .tar.gz file you shared are the original ones that were added by the TeamCity tool? If yes, which one was not removed?

If all the files were replaced using 7-zip, then 7-zip is making a bad job at detecting the format of existing entries, and adding entries in whatever format it wants.

Still, if you're saying 7-zip and SharpCompress both support intermixed format entries in a single archive, then we should support them as well. So I'll start working on that.

@Bio2hazard
Copy link

Bio2hazard commented May 20, 2022

Sorry, you may be right. I'll try to clarify.

I created a LINQPad Script (gist) to read the headers of the files in the tar archive and compare the headers between 2 archives.

I was trying to determine if 7z changes the tar headers when files are added through the 7z GUI, as you are suspecting.

I extracted and untar'd the original archive as produced by teamcity, then made a copy of it and through the 7z GUI deleted, and then re-added a single file (from the extracted original archive). In that test, the header did not show up as changed by 7z, but I'm not convinced it was working right.

I performed the experiment again today, this time removing and re-adding every file, and now the headers do show up differently, so 7z is modifying the tar headers.

At a glance, I can tell that some differences are that user / group are removed and that the file modes are different.

So for comparison, here is the Base64 of a tar header as created directly by TeamCity:
YXNzZXRzLzAtNjA4ZWNhOWJiMWRlMjE4ZjFiZTUucG5nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAxMDA2NDQgMDAzNTIzMCAwMDM1MjMwIDAwMDAwMTMzMzEwIDE0MjI1MDY1MzYwIDAyMDM0MwAgMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMHRlYW1jaXR5AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAdGVhbWNpdHkAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAwMDAwMDAwIDAwMDAwMDAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

And the Base64 of the tar header for the same file as created by 7z:
YXNzZXRzLzAtNjA4ZWNhOWJiMWRlMjE4ZjFiZTUucG5nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAxMDA3NzcAMDAwMDAwMAAwMDAwMDAwADAwMDAwMTMzMzEwADE0MjI1MDY1MzYwADAxMzIyMAAgMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=


Here is the Base64 of a long link tar header as created directly by TeamCity:
Li8uL0BMb25nTGluawAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAxMDA2NDQgMDAwMDAwMCAwMDAwMDAwIDAwMDAwMDAwMTY0IDE0MjI1MDY1MzYwIDAxMTYzNwAgTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhciAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAwMDAwMDAwIDAwMDAwMDAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

And the corresponding Base64 of a long link tar header as created by 7z:
Li8uL0BMb25nTGluawAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAxMDA3NzcAMDAwMDAwMAAwMDAwMDAwADAwMDAwMDAwMTY0ADE0MjI1MDY1MzYwADAxMDEwNgAgTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=


Finally, here is the Base64 of a long file name entry as created directly by TeamCity:
YXNzZXRzL2ZlYXR1cmVzLmNoYW5uZWwtaGVhZGVyLmNvbXBvbmVudHMuc29jaWFsLWJ1dHRvbnMuY29tcG9uZW50cy5mb2xsb3dlci1lbW90ZXMtcHJvbXB0LTRlNzZjZmYyNjAxMDA2NDQgMDAzNTIzMCAwMDM1MjMwIDAwMDAwMDEzMzQwIDE0MjI1MDY1MzYwIDAzNjMzMwAgMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMHRlYW1jaXR5AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAdGVhbWNpdHkAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAwMDAwMDAwIDAwMDAwMDAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

And here is the corresponding Base64 of a long file name entry as created by 7z:
YXNzZXRzL2ZlYXR1cmVzLmNoYW5uZWwtaGVhZGVyLmNvbXBvbmVudHMuc29jaWFsLWJ1dHRvbnMuY29tcG9uZW50cy5mb2xsb3dlci1lbW90ZXMtcHJvbXB0LTRlNzZjZmYyADAxMDA3NzcAMDAwMDAwMAAwMDAwMDAwADAwMDAwMDEzMzQwADE0MjI1MDY1MzYwADAzMTEyMgAgMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

I think that should cover all cases as far as header differences goes.

It bears repeating that the original, completely unmodified, downloaded straight from teamcity .tar.gz archive failed to process with the new TarReader in the same manner as the test file I provided here. So even if the header of the teamcity entry is not malformed, it still failed in the same way.

Hope that helps and clears things up. I'm an SDE, not DevOps so I didn't have the option of producing a sharable/sanitized archive directly in teamcity.

@carlossanlop
Copy link
Member Author

Thanks for spending time helping with this, @Bio2hazard. I appreciate it a lot.

I performed the experiment again today, this time removing and re-adding every file, and now the headers do show up differently, so 7z is modifying the tar headers.

Ah good to know, my suspicion was correct.

At a glance, I can tell that some differences are that user / group are removed and that the file modes are different.

This might be because you were doing this on a Windows machine. The concept of mode, uname and gname are from Unix. This is ok.

So even if the header of the teamcity entry is not malformed, it still failed in the same way.

This is interesting, and I think it's the only thing left for us to clarify, if that's ok. Do you mind sharing the exception message/callstack when using TarReader with the original TeamCity archive? I think the exception messages you shared in your previous comments came from the modified file, which contained entries that were added by 7-zip, not TeamCity. Feel free to obfuscate any sensitive information if needed.

Actually, if you have time and the chance to help a bit more: if you can download HxD Hex Editor and open the .tar file generated with TeamCity (not the .tar.gz file) you can see the raw bytes of the archive. The metadata is always located in the first 512 bytes of each entry. So for example, in the archive you shared that was modified with 7-zip, the magic and version fields of the first entry can be located in the blocks 257-264. If you scroll down, the magic and version fields of the LongLink entry are located in the blocks 9473-9480.

Can you share the magic and version fields of the LongLink entry in the TeamCity archive? This would help us determine if it's malformed or not.

Here's what I get from the first entry:

Here are the magic and version of the LongLink that 7-zip modified, they are malformed - Notice that it the value is ustar\000, which is the posix format, instead of ustar \0, which is the GNU format expected for a LongLink entry:

And again, thank you so much for testing this and for your patience.

@Bio2hazard
Copy link

@carlossanlop I'll get those stack traces for you

Can you share the magic and version fields of the LongLink entry in the TeamCity archive?

I wanted to quickly add that you should be able to just base64 decode the headers in my last message. They should have exactly what you are looking for. It's the first 512 bytes of the entries, from both teamcity and 7z.

@carlossanlop
Copy link
Member Author

carlossanlop commented May 20, 2022

I wanted to quickly add that you should be able to just base64 decode the headers in my last message.

Oh cool, never done that. Allow me to figure it out then, and will come back to you if I don't find the info I wanted.

EDIT: this is AWESOME! https://www.base64decode.org/

@Bio2hazard
Copy link

@carlossanlop nice looks like it worked out! Base64 is just a quick and convenient way to share byte[]s through a text format. You can use Convert.FromBase64String("<base 64 header from before>") to get the raw byte[] in C# if that helps.

I got the stack traces for you:

WithSystemsFormatTar_PreDecompressed()

In this variant, the .tar.gz is first decompressed via GZipStream into a seekable MemoryStream, which is passed into TarReader

TeamCity Archive:

System.FormatException: Entry 'assets/features.channel-header.components.social-buttons.components.follower-emotes-prompt-4e76cff26' was expected to be in the GNU format, but did not have the expected version data.
   at System.Formats.Tar.TarHeader.ReadVersionAttribute(Span`1 buffer)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryProcessGnuMetadataHeader(TarHeader header, Boolean copyData, TarHeader& finalHeader)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Program.<<Main>$>g__WithSystemsFormatTar_PreDecompressed|0_0(<>c__DisplayClass0_0&) in C:\Development\Experimentation\net7testing\Program.cs:line 27
   at Program.<Main>$(String[] args) in C:\Development\Experimentation\net7testing\Program.cs:line 8

Test Archive:

System.FormatException: Entry 'testing-a-verylong.filename-that-cant-fit-in-a-single.tar-header-and-will-probably-fail-to-load-cor' was expected to be in the GNU format, but did not have the expected version data.
   at System.Formats.Tar.TarHeader.ReadVersionAttribute(Span`1 buffer)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryProcessGnuMetadataHeader(TarHeader header, Boolean copyData, TarHeader& finalHeader)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Program.<<Main>$>g__WithSystemsFormatTar_PreDecompressed|0_0(<>c__DisplayClass0_0&) in C:\Development\Experimentation\net7testing\Program.cs:line 27
   at Program.<Main>$(String[] args) in C:\Development\Experimentation\net7testing\Program.cs:line 8

Stack Traces are identical


WithSystemsFormatTar_StreamedDecompression()

In this variant, the TarReader reads directly from the nonseekable GZipStream.
No exception gets thrown here, but only a single entry is returned. The stack trace represents the part where it failed when processing the second file.

TeamCity Archive:

TarHelpers.IsAllNullBytes()at TarHelpers.cs:line 106
TarHeader.TryReadCommonAttributes()at TarHeader.Read.cs:line 309
TarHeader.TryGetNextHeader()
TarReader.TryGetNextEntryHeader()
TarReader.GetNextEntry()
Program.<<Main>$>g__WithSystemsFormatTar_StreamedDecompression|0_1()
Program.<Main>$()

Test Archive:

TarHelpers.IsAllNullBytes()at TarHelpers.cs:line 106
TarHeader.TryReadCommonAttributes()at TarHeader.Read.cs:line 309
TarHeader.TryGetNextHeader()
TarReader.TryGetNextEntryHeader()
TarReader.GetNextEntry()
Program.<<Main>$>g__WithSystemsFormatTar_StreamedDecompression|0_1()
Program.<Main>$()

The stack traces are again identical. This is the relevant failure point in TryReadCommonAttributes

// Empty checksum means this is an invalid (all blank) entry, finish early
Span<byte> spanChecksum = buffer.Slice(FieldLocations.Checksum, FieldLengths.Checksum);
if (TarHelpers.IsAllNullBytes(spanChecksum))
{
    return false;
}

The debugger showed all bytes of the checksum being null, which bubbles up and causes GetNextEntry to return false.


WithSystemsFormatTar_StreamedDecompression_WithCopyData()

In this variant, the TarReader reads directly from the nonseekable GZipStream just like the previous one. The only difference is that this uses tarReader.GetNextEntry(true) when reading entries.

TeamCity Archive:

System.IO.EndOfStreamException: Attempted to read past the end of the stream.
   at System.Formats.Tar.TarHelpers.CopyBytes(Stream origin, Stream destination, Int64 bytesToCopy)
   at System.Formats.Tar.TarHeader.GetDataStream(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarHeader.ProcessDataBlock(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Program.<<Main>$>g__WithSystemsFormatTar_StreamedDecompression_WithCopyData|0_2(<>c__DisplayClass0_0&) in C:\Development\Experimentation\net7testing\Program.cs:line 81
   at Program.<Main>$(String[] args) in C:\Development\Experimentation\net7testing\Program.cs:line 8

Test Archive:

System.FormatException: Entry 'testing-a-verylong.filename-that-cant-fit-in-a-single.tar-header-and-will-probably-fail-to-load-cor' was expected to be in the GNU format, but did not have the expected version data.
   at System.Formats.Tar.TarHeader.ReadVersionAttribute(Span`1 buffer)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryProcessGnuMetadataHeader(TarHeader header, Boolean copyData, TarHeader& finalHeader)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Program.<<Main>$>g__WithSystemsFormatTar_StreamedDecompression_WithCopyData|0_2(<>c__DisplayClass0_0&) in C:\Development\Experimentation\net7testing\Program.cs:line 81
   at Program.<Main>$(String[] args) in C:\Development\Experimentation\net7testing\Program.cs:line 8

This is the only test case where the result was different between the two. I hope that helps!

carlossanlop added a commit that referenced this issue Jun 2, 2022
* Address some System.Formats.Tar TODOs (infra and syscalls) (#69107)

Re-submitting the changes approved in PR: #69107

Addresses: #68230

Includes an extra change to prevent the android build failure with the addition of the new native call to getgrgid_r.

Co-authored-by: carlossanlop <carlossanlop@users.noreply.github.com>
Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>
@adamsitnik adamsitnik modified the milestones: 7.0.0, 8.0.0 Aug 12, 2022
@carlossanlop carlossanlop modified the milestones: 8.0.0, 9.0.0 Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants