Tar APIs pending feedback to address #68230

carlossanlop · 2022-04-19T20:07:26Z

Initial PR changes were introduced here: #67883

The following is pending feedback I need to address (not urgent, for 8.0, mostly just additional testing coverage):

Perf improvements:

Address stackalloc and stream-related suggestions by @stephentoub .
Improve code that retrieves the extended attributes from the data section. Use IndexOf and related APIs suggested by Stephen.
After IndexOfAnyExcept gets merged, consider using it: Add initial version of {Last}IndexOfAnyExcept #67941
Improve the code from Archiving.Utils.Windows.cs Implement Tar APIs #67883 (comment)

Additional test coverage:

Done:

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2022-04-19T20:07:29Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2022-04-19T20:11:46Z

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

I received feedback in my first Tar PR #67883 that I would like to address in a follow-up PR.

Author:	carlossanlop
Assignees:	carlossanlop
Labels:	`area-System.IO`, `untriaged`
Milestone:	-

Bio2hazard · 2022-05-17T00:23:34Z

On .NET 7 Preview 4,
I've been testing the Tar APIs on some tar.gz files created on our teamcity agents (in ustar format) and ran into 2 issues:

When using TarReader over a unseekable GZipStream it manages to read the first entry, but fails on the second one with an EndOfStreamException emitted in TarHelpers.cs line 40.
When using TarReader over a seekable MemoryStream, it fails when it encounters a long filename that the tar format splits into a separate header. It detects it as a GNU format instead of ustar and fails with Entry 'blah' was expected to be in the GNU format, but did not have the expected version data.

Hope that helps.

carlossanlop · 2022-05-17T04:17:26Z

Thank you for your report, @Bio2hazard ! Any chance you can share a sample tar for each one of your experiments?

Bio2hazard · 2022-05-17T18:44:57Z

@carlossanlop yes I was able to reproduce it without sensitive information, thankfully. Here it is:
testarchive.tar.gz

It has 3 files in it:

testfile1.txt which should have 5 paragraphs of lorem ipsum
testfile3.txt which has the same content as testfile1.txt
testing-a-verylong.filename-that-cant-fit-in-a-single.tar-header-and-will-probably-fail-to-load-correctly.txt which just contains its own filename

I was trying to benchmark it, so I'll just drop the benchmarking code in here. Goal is just to get the content byte[] for every file in the archive.

using System.Formats.Tar;
using System.IO.Compression;
using BenchmarkDotNet.Attributes;

namespace Benchmarks7;

[GcServer(true)]
[GcForce(true)]
[MemoryDiagnoser]
public class ReadFromTarGz
{
    private static byte[] fileData;

    public ReadFromTarGz()
    {
	    fileData = File.ReadAllBytes(@"testarchive.tar.gz");
    }

    [Benchmark]
    public List<byte[]> WithSharpCompress()
    {
        var ret = new List<byte[]>();
        using (var fileStream = new MemoryStream(fileData))
        using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
        using (var tarReader = SharpCompress.Readers.Tar.TarReader.Open(stream))
        {
    	    while (tarReader.MoveToNextEntry())
    	    {
    		    var entry = tarReader.Entry;
                if (entry.IsDirectory)
    		    {
    			    continue;
    		    }

    		    byte[] entryData = new byte[entry.Size];
    		    using (var fileMemoryStream = new MemoryStream(entryData))
    		    using (var entryStream = tarReader.OpenEntryStream())
    		    {
    			    entryStream.CopyTo(fileMemoryStream);
    		    }
                ret.Add(entryData);
    	    }
        }

        return ret;
    }

    [Benchmark]
    public List<byte[]> WithSystemsFormatTar_PreDecompressed()
    {
	    var ret = new List<byte[]>();
	    using var decompressedStream = new MemoryStream();

	    using (var fileStream = new MemoryStream(fileData))
	    using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
	    {
		    stream.CopyTo(decompressedStream);
	    }

	    decompressedStream.Position = 0;

	    using (var tarReader = new System.Formats.Tar.TarReader(decompressedStream))
	    {
		    System.Formats.Tar.TarEntry entry;
		    while ((entry = tarReader.GetNextEntry()) != null)
		    {
			    if (entry.EntryType == TarEntryType.Directory)
			    {
				    continue;
			    }

			    byte[] entryData = new byte[entry.Length];
			    using (var fileMemoryStream = new MemoryStream(entryData))
			    {
				    entry.DataStream.CopyTo(fileMemoryStream);
			    }
			    ret.Add(entryData);
		    }
	    }

	    return ret;
    }

    [Benchmark]
    public List<byte[]> WithSystemsFormatTar_StreamedDecompression()
    {
	    var ret = new List<byte[]>();
	    using (var fileStream = new MemoryStream(fileData))
	    using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
	    using (var tarReader = new System.Formats.Tar.TarReader(stream))
	    {
		    System.Formats.Tar.TarEntry entry;
		    while ((entry = tarReader.GetNextEntry()) != null)
		    {
			    if (entry.EntryType == TarEntryType.Directory)
			    {
				    continue;
			    }

			    byte[] entryData = new byte[entry.Length];
			    using (var fileMemoryStream = new MemoryStream(entryData))
			    {
				    entry.DataStream.CopyTo(fileMemoryStream);
			    }
			    ret.Add(entryData);
		    }
	    }

	    return ret;
    }

    [Benchmark]
    public List<byte[]> WithSystemsFormatTar_StreamedDecompression_WithCopyData()
    {
	    var ret = new List<byte[]>();
	    using (var fileStream = new MemoryStream(fileData))
	    using (var stream = new System.IO.Compression.GZipStream(fileStream, CompressionMode.Decompress))
	    using (var tarReader = new System.Formats.Tar.TarReader(stream))
	    {
		    System.Formats.Tar.TarEntry entry;
		    while ((entry = tarReader.GetNextEntry(true)) != null)
		    {
			    if (entry.EntryType == TarEntryType.Directory)
			    {
				    continue;
			    }

			    byte[] entryData = new byte[entry.Length];
			    using (var fileMemoryStream = new MemoryStream(entryData))
			    {
				    entry.DataStream.CopyTo(fileMemoryStream);
			    }
			    ret.Add(entryData);
		    }
	    }

	    return ret;
    }
}

WithSharpCompress works as expected
WithSystemsFormatTar_PreDecompressed fails with Exception (which I'll paste at the bottom of the message)
WithSystemsFormatTar_StreamedDecompression appears to work, but only contains a single result because of the aforementioned EndOfStreamException which gets caught internally, so it never gets to the long filename file
WithSystemsFormatTar_StreamedDecompression_WithCopyData throws the same exception as WithSystemsFormatTar_PreDecompressed

Exception:

System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
 ---> System.FormatException: Entry 'testing-a-verylong.filename-that-cant-fit-in-a-single.tar-header-and-will-probably-fail-to-load-cor' was expected to be in the GNU format, but did not have the expected version data.
   at System.Formats.Tar.TarHeader.ReadVersionAttribute(Span`1 buffer)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryProcessGnuMetadataHeader(TarHeader header, Boolean copyData, TarHeader& finalHeader)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Benchmarks7.ReadFromTarGz.WithSystemsFormatTar_PreDecompressed() in C:\Development\Benchmarks7\ReadFromTarGz.cs:line 65

I hope that helps!

carlossanlop · 2022-05-19T07:09:21Z

Hi again @Bio2hazard ,

What tool did you use to generate the tar file inside your gz file?

I inspected the internal tar file with the Hex Editor HxD. The magic and version metadata fields of the first 2 entries indicate the archive is in ustar format, but then it contains an unexpected GNU format entry with the TarType LongLink in the 3rd position (and then a 4th entry containing the actual data, because the LongLink entry is a metadata entry). Hence why you're getting the FormatException thrown: the reader detects that the archive is malformed for containing mixed format entries, and it is expected that the exception is thrown when attempting to read the 3rd one.

Here is what the spec has to say about the magic and version fields depending on the archive format:

For POSIX archives (ustar and pax)

     magic   Contains the magic	value "ustar" followed by a NUL	byte to	indi-
	     cate that this is a POSIX standard	archive.  Full compliance re-
	     quires the	uname and gname	fields be properly set.

     version
	     Version.  This should be "00" (two	copies of the ASCII digit
	     zero) for POSIX standard archives.

And then a few lines below it says:

     Field termination is specified slightly differently by POSIX than by pre-
     vious implementations.  The magic,	uname, and gname fields	must have a
     trailing NUL.

For GNU

 magic   The magic field holds the five characters "ustar" followed	by a
	     space.  Note that POSIX ustar archives have a trailing null.
version
	     The version field holds a space character followed	by a null.
	     Note that POSIX ustar archives use	two copies of the ASCII	digit
	     "0".

In other words:

For POSIX format archives (ustar or pax), the magic + version fields, which are one after the other, should look like: ustar\000.
In the GNU format, should look like: ustar \0.

I'm surprised that SharpCompress and 7-zip are able to open this malformed archive. This tells me that they are flexible when it comes to mixed format entries. The spec does not explicitly indicate the entries should all be in the same format, but I interpreted it this way considering there are metadata entries like L, K (GNU) or x, g (PAX) which precede other entries. Also, the Unix tar CLI tool does not mix format entries, and fails to write entries with long paths and long entries if you create an archive in v7 or ustar, because those formats do not support that.

Based on the above, I see one public API structure change we should make: If archives are allowed to have mixed entries as we are seeing in your example (or in other words, should not expect all entries to be of the same format), then we would have to remove the public Format property from TarReader, and we would have to move it to the base TarEntry class. That way, the user will have to check this property from each individual entry so they know to which format they should cast the returned entry.

Note: The TarWriter receives a Format in the constructor, and this is important, particularly when creating a PAX archive, because the user should be able to add global extended attributes which only show up at the beginning of the archive. So the constructor's format argument should stay. But if the user creates a V7 or a Ustar formatted archive with the writer, and suddenly decides to write an unsupported entry like a GnuTarEntry or a PaxTarEntry with a long path or a long link, we should still throw (current behavior), to prevent the user from intermixing formats, exactly like the Unix tar CLI tool does.

carlossanlop · 2022-05-19T07:17:58Z

@Bio2hazard I'm really curious how you generated your archive, because even if I address the above by allowing intermixing formats, the magic and version of your entries are malformed in the 3rd entry: they indicate ustar instead of in gnu format, even though the entry type is L which is exclusive to the gnu format. And once again, I'm very surprised that SharpCompress and 7-zip were kind enough to allow it 😄 .

Bio2hazard · 2022-05-19T20:23:25Z

@carlossanlop

Yikes, I was unaware that the archive is in a malformed format.

So the original file that I first encountered this problem with was created by TeamCity Enterprise. A build agent designated a folder as the output artifact, and TeamCity automatically turned the folder into a .tar.gz. I think these are the relevant docs in particular the target_archive part.

These are the relevant build logs:

[10:46:02] Publishing artifacts (13s)
[10:46:02] [Publishing artifacts] Collecting files to publish: [+:project/dist/dev=>project-dev-12345-67890-abcdefg.tar.gz]
[10:46:02] [Publishing artifacts] Creating archive project-dev-12345-67890-abcdefg.tar.gz (8s)
[10:46:02] [Creating archive project-dev-12345-67890-abcdefg.tar.gz] Creating /opt/jetbrains/TeamCity_agent/temp/buildTmp/TarPreprocessor1234567890123456789/project-dev-12345-67890-abcdefg.tar.gz

Of course, I can't exactly share our internal build artifacts, so what I did was:

Downloaded the build artifact onto my windows 10 machine (and ran into the reported errors when trying to benchmark)
Opened (not extracted) the .tar.gz archive with the 7z GUI
Navigated into the .tar file
Through the 7z GUI I modified the contents of the tar by adding the 3 test files, and deleting all the proprietary build artifacts
Then closed the 7z GUI window, which prompts 7z to ask if you want to persist the changes, which I accepted

My hope here was that modifying the file in this manner would cause 7z to use the same tar format/style as the original archive produced by TeamCity.

While I understand that that is a fairly ... unorthodox way of creating an archive, I am actually fairly confident it is a faithful reproduction. Debugging the actual TeamCity archive and the supplied test archive shows that they both fail in the same way with the same stack traces.

Furthermore, I had actually as a proof of concept created a very rudimentary manual tar extractor based on the GNU spec and it also got tripped up and failed at the long link part 😄

TeamCity is a big player in the CI/CD space, and the artifact publisher that produced this artifact is built-in and native. Given that I could not find any notable user discussions around problems with the artifact archives produced by TeamCity, I have to assume that it is very widely supported despite being malformed. Otherwise I am sure there would be plenty of discussions about it.

Please let me know if you need anything else. If you are concerned that the reproduction archive isn't accurate enough I can provide the header bytes of the original archive.

Edit: To further make sure the reproduction is accurate:

I took the original build artifact from teamcity
made a copy of it
opened the copy with the 7z GUI, removed a file and saved it
opened the copy with the 7z GUI a second time, re-added the file and saved it
compared all 512-byte tar headers between the original and the 7z modified copy: They all matched

carlossanlop · 2022-05-20T02:34:42Z

Thanks for the information, @Bio2hazard . A few more questions:

So the original file that I first encountered this problem with was created by TeamCity Enterprise

Do you happen to know what was the original format of the archive? I couldn't find it in the documentation you shared. The reason I ask is because you then mentioned:

Through the 7z GUI I modified the contents of the tar by adding the 3 test files, and deleting all the proprietary build artifacts

Which I interpret as: you deleted all the original files that the TeamCity tool generated, then added new files using 7-zip. So it's 7-zip who's adding the (seemingly) malformed files, not TeamCity.

Furthermore, I had actually as a proof of concept created a very rudimentary manual tar extractor based on the GNU spec and it also got tripped up and failed at the long link part

Do you mean you tried reading the entries using the new TarReader API, or did you use some other tools? If you used TarReader and it failed, it was most likely because of the malformed magic and version. I should make it more flexible and avoid failing if the magic is not well formed, see if that helps with your case.

Given that I could not find any notable user discussions around problems with the artifact archives produced by TeamCity, I have to assume that it is very widely supported despite being malformed.

I don't think the archive created with the TeamCity tool is generating malformed entries. I think it was 7-zip, since you did not preserve any of the original artifacts.

opened the copy with the 7z GUI, removed a file and saved it

opened the copy with the 7z GUI a second time, re-added the file and saved it

compared all 512-byte tar headers between the original and the 7z modified copy: They all matched

These 3 steps are confusing: Are you saying that some of the files in the .tar.gz file you shared are the original ones that were added by the TeamCity tool? If yes, which one was not removed?

If all the files were replaced using 7-zip, then 7-zip is making a bad job at detecting the format of existing entries, and adding entries in whatever format it wants.

Still, if you're saying 7-zip and SharpCompress both support intermixed format entries in a single archive, then we should support them as well. So I'll start working on that.

Bio2hazard · 2022-05-20T19:30:31Z

Sorry, you may be right. I'll try to clarify.

I created a LINQPad Script (gist) to read the headers of the files in the tar archive and compare the headers between 2 archives.

I was trying to determine if 7z changes the tar headers when files are added through the 7z GUI, as you are suspecting.

I extracted and untar'd the original archive as produced by teamcity, then made a copy of it and through the 7z GUI deleted, and then re-added a single file (from the extracted original archive). In that test, the header did not show up as changed by 7z, but I'm not convinced it was working right.

I performed the experiment again today, this time removing and re-adding every file, and now the headers do show up differently, so 7z is modifying the tar headers.

At a glance, I can tell that some differences are that user / group are removed and that the file modes are different.

So for comparison, here is the Base64 of a tar header as created directly by TeamCity:
YXNzZXRzLzAtNjA4ZWNhOWJiMWRlMjE4ZjFiZTUucG5nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAxMDA2NDQgMDAzNTIzMCAwMDM1MjMwIDAwMDAwMTMzMzEwIDE0MjI1MDY1MzYwIDAyMDM0MwAgMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMHRlYW1jaXR5AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAdGVhbWNpdHkAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAwMDAwMDAwIDAwMDAwMDAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

And the Base64 of the tar header for the same file as created by 7z:
YXNzZXRzLzAtNjA4ZWNhOWJiMWRlMjE4ZjFiZTUucG5nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAxMDA3NzcAMDAwMDAwMAAwMDAwMDAwADAwMDAwMTMzMzEwADE0MjI1MDY1MzYwADAxMzIyMAAgMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

Here is the Base64 of a long link tar header as created directly by TeamCity:
Li8uL0BMb25nTGluawAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAxMDA2NDQgMDAwMDAwMCAwMDAwMDAwIDAwMDAwMDAwMTY0IDE0MjI1MDY1MzYwIDAxMTYzNwAgTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhciAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAwMDAwMDAwIDAwMDAwMDAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

And the corresponding Base64 of a long link tar header as created by 7z:
Li8uL0BMb25nTGluawAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAxMDA3NzcAMDAwMDAwMAAwMDAwMDAwADAwMDAwMDAwMTY0ADE0MjI1MDY1MzYwADAxMDEwNgAgTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

Finally, here is the Base64 of a long file name entry as created directly by TeamCity:
YXNzZXRzL2ZlYXR1cmVzLmNoYW5uZWwtaGVhZGVyLmNvbXBvbmVudHMuc29jaWFsLWJ1dHRvbnMuY29tcG9uZW50cy5mb2xsb3dlci1lbW90ZXMtcHJvbXB0LTRlNzZjZmYyNjAxMDA2NDQgMDAzNTIzMCAwMDM1MjMwIDAwMDAwMDEzMzQwIDE0MjI1MDY1MzYwIDAzNjMzMwAgMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMHRlYW1jaXR5AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAdGVhbWNpdHkAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAwMDAwMDAwIDAwMDAwMDAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

And here is the corresponding Base64 of a long file name entry as created by 7z:
YXNzZXRzL2ZlYXR1cmVzLmNoYW5uZWwtaGVhZGVyLmNvbXBvbmVudHMuc29jaWFsLWJ1dHRvbnMuY29tcG9uZW50cy5mb2xsb3dlci1lbW90ZXMtcHJvbXB0LTRlNzZjZmYyADAxMDA3NzcAMDAwMDAwMAAwMDAwMDAwADAwMDAwMDEzMzQwADE0MjI1MDY1MzYwADAzMTEyMgAgMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB1c3RhcgAwMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

I think that should cover all cases as far as header differences goes.

It bears repeating that the original, completely unmodified, downloaded straight from teamcity .tar.gz archive failed to process with the new TarReader in the same manner as the test file I provided here. So even if the header of the teamcity entry is not malformed, it still failed in the same way.

Hope that helps and clears things up. I'm an SDE, not DevOps so I didn't have the option of producing a sharable/sanitized archive directly in teamcity.

carlossanlop · 2022-05-20T20:07:55Z

Thanks for spending time helping with this, @Bio2hazard. I appreciate it a lot.

I performed the experiment again today, this time removing and re-adding every file, and now the headers do show up differently, so 7z is modifying the tar headers.

Ah good to know, my suspicion was correct.

At a glance, I can tell that some differences are that user / group are removed and that the file modes are different.

This might be because you were doing this on a Windows machine. The concept of mode, uname and gname are from Unix. This is ok.

So even if the header of the teamcity entry is not malformed, it still failed in the same way.

This is interesting, and I think it's the only thing left for us to clarify, if that's ok. Do you mind sharing the exception message/callstack when using TarReader with the original TeamCity archive? I think the exception messages you shared in your previous comments came from the modified file, which contained entries that were added by 7-zip, not TeamCity. Feel free to obfuscate any sensitive information if needed.

Actually, if you have time and the chance to help a bit more: if you can download HxD Hex Editor and open the .tar file generated with TeamCity (not the .tar.gz file) you can see the raw bytes of the archive. The metadata is always located in the first 512 bytes of each entry. So for example, in the archive you shared that was modified with 7-zip, the magic and version fields of the first entry can be located in the blocks 257-264. If you scroll down, the magic and version fields of the LongLink entry are located in the blocks 9473-9480.

Can you share the magic and version fields of the LongLink entry in the TeamCity archive? This would help us determine if it's malformed or not.

Here's what I get from the first entry:

Here are the magic and version of the LongLink that 7-zip modified, they are malformed - Notice that it the value is ustar\000, which is the posix format, instead of ustar \0, which is the GNU format expected for a LongLink entry:

And again, thank you so much for testing this and for your patience.

Bio2hazard · 2022-05-20T20:13:10Z

@carlossanlop I'll get those stack traces for you

Can you share the magic and version fields of the LongLink entry in the TeamCity archive?

I wanted to quickly add that you should be able to just base64 decode the headers in my last message. They should have exactly what you are looking for. It's the first 512 bytes of the entries, from both teamcity and 7z.

carlossanlop · 2022-05-20T21:19:32Z

I wanted to quickly add that you should be able to just base64 decode the headers in my last message.

Oh cool, never done that. Allow me to figure it out then, and will come back to you if I don't find the info I wanted.

EDIT: this is AWESOME! https://www.base64decode.org/

Bio2hazard · 2022-05-20T22:04:57Z

@carlossanlop nice looks like it worked out! Base64 is just a quick and convenient way to share byte[]s through a text format. You can use Convert.FromBase64String("<base 64 header from before>") to get the raw byte[] in C# if that helps.

I got the stack traces for you:

WithSystemsFormatTar_PreDecompressed()

In this variant, the .tar.gz is first decompressed via GZipStream into a seekable MemoryStream, which is passed into TarReader

TeamCity Archive:

System.FormatException: Entry 'assets/features.channel-header.components.social-buttons.components.follower-emotes-prompt-4e76cff26' was expected to be in the GNU format, but did not have the expected version data.
   at System.Formats.Tar.TarHeader.ReadVersionAttribute(Span`1 buffer)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryProcessGnuMetadataHeader(TarHeader header, Boolean copyData, TarHeader& finalHeader)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Program.<<Main>$>g__WithSystemsFormatTar_PreDecompressed|0_0(<>c__DisplayClass0_0&) in C:\Development\Experimentation\net7testing\Program.cs:line 27
   at Program.<Main>$(String[] args) in C:\Development\Experimentation\net7testing\Program.cs:line 8

Test Archive:

System.FormatException: Entry 'testing-a-verylong.filename-that-cant-fit-in-a-single.tar-header-and-will-probably-fail-to-load-cor' was expected to be in the GNU format, but did not have the expected version data.
   at System.Formats.Tar.TarHeader.ReadVersionAttribute(Span`1 buffer)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryProcessGnuMetadataHeader(TarHeader header, Boolean copyData, TarHeader& finalHeader)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Program.<<Main>$>g__WithSystemsFormatTar_PreDecompressed|0_0(<>c__DisplayClass0_0&) in C:\Development\Experimentation\net7testing\Program.cs:line 27
   at Program.<Main>$(String[] args) in C:\Development\Experimentation\net7testing\Program.cs:line 8

Stack Traces are identical

WithSystemsFormatTar_StreamedDecompression()

In this variant, the TarReader reads directly from the nonseekable GZipStream.
No exception gets thrown here, but only a single entry is returned. The stack trace represents the part where it failed when processing the second file.

TeamCity Archive:

TarHelpers.IsAllNullBytes()at TarHelpers.cs:line 106
TarHeader.TryReadCommonAttributes()at TarHeader.Read.cs:line 309
TarHeader.TryGetNextHeader()
TarReader.TryGetNextEntryHeader()
TarReader.GetNextEntry()
Program.<<Main>$>g__WithSystemsFormatTar_StreamedDecompression|0_1()
Program.<Main>$()

Test Archive:

TarHelpers.IsAllNullBytes()at TarHelpers.cs:line 106
TarHeader.TryReadCommonAttributes()at TarHeader.Read.cs:line 309
TarHeader.TryGetNextHeader()
TarReader.TryGetNextEntryHeader()
TarReader.GetNextEntry()
Program.<<Main>$>g__WithSystemsFormatTar_StreamedDecompression|0_1()
Program.<Main>$()

The stack traces are again identical. This is the relevant failure point in TryReadCommonAttributes

// Empty checksum means this is an invalid (all blank) entry, finish early
Span<byte> spanChecksum = buffer.Slice(FieldLocations.Checksum, FieldLengths.Checksum);
if (TarHelpers.IsAllNullBytes(spanChecksum))
{
    return false;
}

The debugger showed all bytes of the checksum being null, which bubbles up and causes GetNextEntry to return false.

WithSystemsFormatTar_StreamedDecompression_WithCopyData()

In this variant, the TarReader reads directly from the nonseekable GZipStream just like the previous one. The only difference is that this uses tarReader.GetNextEntry(true) when reading entries.

TeamCity Archive:

System.IO.EndOfStreamException: Attempted to read past the end of the stream.
   at System.Formats.Tar.TarHelpers.CopyBytes(Stream origin, Stream destination, Int64 bytesToCopy)
   at System.Formats.Tar.TarHeader.GetDataStream(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarHeader.ProcessDataBlock(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Program.<<Main>$>g__WithSystemsFormatTar_StreamedDecompression_WithCopyData|0_2(<>c__DisplayClass0_0&) in C:\Development\Experimentation\net7testing\Program.cs:line 81
   at Program.<Main>$(String[] args) in C:\Development\Experimentation\net7testing\Program.cs:line 8

Test Archive:

System.FormatException: Entry 'testing-a-verylong.filename-that-cant-fit-in-a-single.tar-header-and-will-probably-fail-to-load-cor' was expected to be in the GNU format, but did not have the expected version data.
   at System.Formats.Tar.TarHeader.ReadVersionAttribute(Span`1 buffer)
   at System.Formats.Tar.TarHeader.TryGetNextHeader(Stream archiveStream, Boolean copyData)
   at System.Formats.Tar.TarReader.TryProcessGnuMetadataHeader(TarHeader header, Boolean copyData, TarHeader& finalHeader)
   at System.Formats.Tar.TarReader.TryGetNextEntryHeader(TarHeader& header, Boolean copyData)
   at System.Formats.Tar.TarReader.GetNextEntry(Boolean copyData)
   at Program.<<Main>$>g__WithSystemsFormatTar_StreamedDecompression_WithCopyData|0_2(<>c__DisplayClass0_0&) in C:\Development\Experimentation\net7testing\Program.cs:line 81
   at Program.<Main>$(String[] args) in C:\Development\Experimentation\net7testing\Program.cs:line 8

This is the only test case where the result was different between the two. I hope that helps!

* Address some System.Formats.Tar TODOs (infra and syscalls) (#69107) Re-submitting the changes approved in PR: #69107 Addresses: #68230 Includes an extra change to prevent the android build failure with the addition of the new native call to getgrgid_r. Co-authored-by: carlossanlop <carlossanlop@users.noreply.github.com> Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 19, 2022

carlossanlop self-assigned this Apr 19, 2022

carlossanlop added the area-System.IO label Apr 19, 2022

carlossanlop removed the untriaged New issue has not been triaged by the area owner label Apr 19, 2022

carlossanlop mentioned this issue Apr 19, 2022

Implement Tar APIs #67883

Merged

carlossanlop added this to the 7.0.0 milestone May 10, 2022

carlossanlop mentioned this issue May 10, 2022

Address some System.Formats.Tar TODOs (infra and syscalls) #69107

Merged

carlossanlop mentioned this issue May 18, 2022

Resumbmit tar feedback changes, include Android build fix #69469

Merged

carlossanlop mentioned this issue May 19, 2022

[API Proposal]: TarReader should not assume same format for all entries in a tar file #69544

Closed

carlossanlop mentioned this issue May 24, 2022

How to properly set umask in TarEntry and ZipFileExtensions #69760

Closed

carlossanlop mentioned this issue Jun 9, 2022

Add Tar support to PowerShell with the new .NET 7.0 APIs PowerShell/Microsoft.PowerShell.Archive#116

Closed

iSazonov mentioned this issue Aug 5, 2022

Tar crashes on reading the specific file #73440

Closed

adamsitnik modified the milestones: 7.0.0, 8.0.0 Aug 12, 2022

jeffhandley added the area-System.Formats.Tar label Nov 21, 2022

jeffhandley removed the area-System.IO label Nov 21, 2022

carlossanlop modified the milestones: 8.0.0, 9.0.0 Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tar APIs pending feedback to address #68230

Tar APIs pending feedback to address #68230

carlossanlop commented Apr 19, 2022 •

edited

Loading

dotnet-issue-labeler bot commented Apr 19, 2022

ghost commented Apr 19, 2022

Bio2hazard commented May 17, 2022

carlossanlop commented May 17, 2022

Bio2hazard commented May 17, 2022

carlossanlop commented May 19, 2022 •

edited

Loading

carlossanlop commented May 19, 2022

Bio2hazard commented May 19, 2022 •

edited

Loading

carlossanlop commented May 20, 2022

Bio2hazard commented May 20, 2022 •

edited

Loading

carlossanlop commented May 20, 2022

Bio2hazard commented May 20, 2022

carlossanlop commented May 20, 2022 •

edited

Loading

Bio2hazard commented May 20, 2022

Tar APIs pending feedback to address #68230

Tar APIs pending feedback to address #68230

Comments

carlossanlop commented Apr 19, 2022 • edited Loading

dotnet-issue-labeler bot commented Apr 19, 2022

ghost commented Apr 19, 2022

Bio2hazard commented May 17, 2022

carlossanlop commented May 17, 2022

Bio2hazard commented May 17, 2022

carlossanlop commented May 19, 2022 • edited Loading

For POSIX archives (ustar and pax)

For GNU

carlossanlop commented May 19, 2022

Bio2hazard commented May 19, 2022 • edited Loading

carlossanlop commented May 20, 2022

Bio2hazard commented May 20, 2022 • edited Loading

carlossanlop commented May 20, 2022

Bio2hazard commented May 20, 2022

carlossanlop commented May 20, 2022 • edited Loading

Bio2hazard commented May 20, 2022

WithSystemsFormatTar_PreDecompressed()

WithSystemsFormatTar_StreamedDecompression()

WithSystemsFormatTar_StreamedDecompression_WithCopyData()

carlossanlop commented Apr 19, 2022 •

edited

Loading

carlossanlop commented May 19, 2022 •

edited

Loading

Bio2hazard commented May 19, 2022 •

edited

Loading

Bio2hazard commented May 20, 2022 •

edited

Loading

carlossanlop commented May 20, 2022 •

edited

Loading