New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for tar archives #3253

Open
bogdanteleaga opened this Issue Sep 15, 2015 · 9 comments

Comments

Projects
None yet
8 participants
@bogdanteleaga

bogdanteleaga commented Sep 15, 2015

Right now corefx supports zip files as well as gz files. Would it be hard to get it to support tar files as well for compatibility with the other OS's who package files as tgz very often?

A C# implementation already exists at https://code.google.com/p/tar-cs/ and could be used either as a guideline or directly imported.

If this is something that is desired I could work on designing an API, but it shouldn't be hard to visualize how it might look like.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Dec 17, 2015

Member

Right now corefx supports zip files as well as gz files. Would it be hard to get it to support tar files as well for compatibility with the other OS's who package files as tgz very often?

We don't have any existing tar code to leverage, if that's what you mean. Tar is a pretty different format from Zip (particularly since it doesn't compress) so we would need to start mostly from scratch. That's not to say it isn't worthwhile, though. I'd love to be able to handle all popular compression formats.

I'm not sure though where we would even want to put this if we did add it. System.IO.Compression only kind of makes sense since a tar doesn't compress. I guess FileSystem? tar is so frequently associated with gzip it seems incorrect to not place it alongside it.

If this is something that is desired I could work on designing an API, but it shouldn't be hard to visualize how it might look like.

In my opinion it would be ideal for it to be as similar to ZipArchive as possible.

Member

ianhays commented Dec 17, 2015

Right now corefx supports zip files as well as gz files. Would it be hard to get it to support tar files as well for compatibility with the other OS's who package files as tgz very often?

We don't have any existing tar code to leverage, if that's what you mean. Tar is a pretty different format from Zip (particularly since it doesn't compress) so we would need to start mostly from scratch. That's not to say it isn't worthwhile, though. I'd love to be able to handle all popular compression formats.

I'm not sure though where we would even want to put this if we did add it. System.IO.Compression only kind of makes sense since a tar doesn't compress. I guess FileSystem? tar is so frequently associated with gzip it seems incorrect to not place it alongside it.

If this is something that is desired I could work on designing an API, but it shouldn't be hard to visualize how it might look like.

In my opinion it would be ideal for it to be as similar to ZipArchive as possible.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jun 27, 2016

Member

@ericstj @jasonwilliams200OK

Member

ianhays commented Jun 27, 2016

@ericstj @jasonwilliams200OK

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jun 27, 2016

Thanks @ianhays.

From #9673:

Also .bz2 if possible (http://www.bzip.org/1.0.3/html/zlib-compat.html). :)

Usually bz2 is the compressed format which contains a tarball (as bz2 only compresses one file: https://en.wikipedia.org/wiki/Bzip2). We can probably use the same API methods to support bz2 (except for some format specific settings).

This way, tarball expansion / contraction might make sense in S.I.C as part of bz2 (or even zip) compression / decompression.

ghost commented Jun 27, 2016

Thanks @ianhays.

From #9673:

Also .bz2 if possible (http://www.bzip.org/1.0.3/html/zlib-compat.html). :)

Usually bz2 is the compressed format which contains a tarball (as bz2 only compresses one file: https://en.wikipedia.org/wiki/Bzip2). We can probably use the same API methods to support bz2 (except for some format specific settings).

This way, tarball expansion / contraction might make sense in S.I.C as part of bz2 (or even zip) compression / decompression.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jul 5, 2016

Member

Proposal

With .NET running xplat it would be nice to be able to use it with the compression/archive formats common to systems other than Windows. Tar is an archival format popularly used in Unix alongside some sort of single-file compression, and we don't currently have a way of deaing with it. I suggest we add API for Tar that is similar to ZipArchive, as well as some extension methods similar to ZipFile. I'll focus on the former in this issue.

Formats

Tar has been around for a while. A long while. As a result, there are a few different accepted formats that aren't all compatible with each other. Most programs will detect the format of a tar and deal with it accordingly when de-archiving, but they usually will only archive in one format. It's reasonable for us to do the same, though an expansion point in the future would be to allow archiving in multiple formats for potential compat reasons. Unix Tar allows archiving in multiple formats, for example, but will default to the GNU tar format (though this is supposed to change in the next version).

  • Tar V7
    • Tar V7 is the first standard Tar, shipped with V7 of AT&T Unix. It's header is 256 bytes and all subsequent tar formats use the same bytes in the same way, so V7 is compatible pretty much everywhere. Support for it is easy, though file names are limited to 100 bytes.
  • UStar Tar
    • An expansion of V7 that is accepted most places. File name length is boosted to 256 bytes. The canonical Tar.
  • PAX/Posix Tar
    • Pax tar is an expansion of UStar that uses the UStar header but adds extra entries to accomodate metadata that won't fit in the UStar header. This makes Pax backwards-compatible with ustar, though additional files will be produced in de-archiving that represent the metadata. Because metadata is no longer limited to one header block, file names are no longer bounded in length, and permissions may be fully represented.
  • GNU Tar
    • GNU tar is a bit of a mess. There are several different definitions, all of which conflict. However, as it is the current default format for the GNU 'tar' command, we should support de-archiving it.

Basic API

To start, the API should mirror ZipArchive/ZipArchiveEntry.

public class TarArchive : IDisposable {
    public TarArchive(Stream stream);
    public TarArchive(Stream stream, bool leaveOpen);
    public ReadOnlyCollection<TarArchiveEntry> Entries { get; }
    public TarArchiveEntry CreateEntry(string entryName);
    public void Dispose();
    public TarArchiveEntry GetEntry(string entryName);
}
public class TarArchiveEntry {
    public TarArchiveEntry();
    public ZipArchive Archive { get; }
    public string FullName { get; }
    public DateTimeOffset LastWriteTime { get; set; }
    public long Length { get; }
    public string Name { get; }
    public void Delete();
    public Stream Open();
}

The above API works with every format. We could also consider adding additional TarArchiveEntry properties for expanded metadata available only in newer formats, or we could make subclasses e.g. UStarTarArchiveEntry. I'd say that's more of an expansion point than something we should do straight away, however.

Implementation

Tar doesn't have a central directory for entries like Zip does; entries are placed sequentially in the archive with no indexing. This makes finding a particular entry time consuming, but it also means that implementing the format is comparitively simple. The lack of compression of individual files also greatly simplifies implementing a TarArchive class. The difficulty primarily comes from the different formats requiring very different measures for the more complicated features e.g. sparse files in GNU or metadata files in PAX. There are also some more specific rules for edge cases like duplicate entries that we'll need robust tests to validate.

My plan of attack is to split the work into chunks with each chunk being "up for grabs":

  • Add tar test data in multiple formats
  • Add TarArchive/TarArchiveEntry with V7&UStar support and format header detection. Throw error for unsupported formats (e.g. GNU, PAX)
  • Add tar archival/de-archival support for GNU tar
  • Add tar archival/de-archival support for PAX/Posix tar
  • Add TarFile shortcuts
Member

ianhays commented Jul 5, 2016

Proposal

With .NET running xplat it would be nice to be able to use it with the compression/archive formats common to systems other than Windows. Tar is an archival format popularly used in Unix alongside some sort of single-file compression, and we don't currently have a way of deaing with it. I suggest we add API for Tar that is similar to ZipArchive, as well as some extension methods similar to ZipFile. I'll focus on the former in this issue.

Formats

Tar has been around for a while. A long while. As a result, there are a few different accepted formats that aren't all compatible with each other. Most programs will detect the format of a tar and deal with it accordingly when de-archiving, but they usually will only archive in one format. It's reasonable for us to do the same, though an expansion point in the future would be to allow archiving in multiple formats for potential compat reasons. Unix Tar allows archiving in multiple formats, for example, but will default to the GNU tar format (though this is supposed to change in the next version).

  • Tar V7
    • Tar V7 is the first standard Tar, shipped with V7 of AT&T Unix. It's header is 256 bytes and all subsequent tar formats use the same bytes in the same way, so V7 is compatible pretty much everywhere. Support for it is easy, though file names are limited to 100 bytes.
  • UStar Tar
    • An expansion of V7 that is accepted most places. File name length is boosted to 256 bytes. The canonical Tar.
  • PAX/Posix Tar
    • Pax tar is an expansion of UStar that uses the UStar header but adds extra entries to accomodate metadata that won't fit in the UStar header. This makes Pax backwards-compatible with ustar, though additional files will be produced in de-archiving that represent the metadata. Because metadata is no longer limited to one header block, file names are no longer bounded in length, and permissions may be fully represented.
  • GNU Tar
    • GNU tar is a bit of a mess. There are several different definitions, all of which conflict. However, as it is the current default format for the GNU 'tar' command, we should support de-archiving it.

Basic API

To start, the API should mirror ZipArchive/ZipArchiveEntry.

public class TarArchive : IDisposable {
    public TarArchive(Stream stream);
    public TarArchive(Stream stream, bool leaveOpen);
    public ReadOnlyCollection<TarArchiveEntry> Entries { get; }
    public TarArchiveEntry CreateEntry(string entryName);
    public void Dispose();
    public TarArchiveEntry GetEntry(string entryName);
}
public class TarArchiveEntry {
    public TarArchiveEntry();
    public ZipArchive Archive { get; }
    public string FullName { get; }
    public DateTimeOffset LastWriteTime { get; set; }
    public long Length { get; }
    public string Name { get; }
    public void Delete();
    public Stream Open();
}

The above API works with every format. We could also consider adding additional TarArchiveEntry properties for expanded metadata available only in newer formats, or we could make subclasses e.g. UStarTarArchiveEntry. I'd say that's more of an expansion point than something we should do straight away, however.

Implementation

Tar doesn't have a central directory for entries like Zip does; entries are placed sequentially in the archive with no indexing. This makes finding a particular entry time consuming, but it also means that implementing the format is comparitively simple. The lack of compression of individual files also greatly simplifies implementing a TarArchive class. The difficulty primarily comes from the different formats requiring very different measures for the more complicated features e.g. sparse files in GNU or metadata files in PAX. There are also some more specific rules for edge cases like duplicate entries that we'll need robust tests to validate.

My plan of attack is to split the work into chunks with each chunk being "up for grabs":

  • Add tar test data in multiple formats
  • Add TarArchive/TarArchiveEntry with V7&UStar support and format header detection. Throw error for unsupported formats (e.g. GNU, PAX)
  • Add tar archival/de-archival support for GNU tar
  • Add tar archival/de-archival support for PAX/Posix tar
  • Add TarFile shortcuts
@jstarks

This comment has been minimized.

Show comment
Hide comment
@jstarks

jstarks Aug 24, 2016

@ianhays I recently implemented a set of classes for manipulating tar files, including ustar and PAX (but not GNU, IIRC) support. We needed this for our Docker PowerShell cmdlets. They might be a good starting point for this work.

https://github.com/Microsoft/Docker-PowerShell/tree/master/src/Tar

jstarks commented Aug 24, 2016

@ianhays I recently implemented a set of classes for manipulating tar files, including ustar and PAX (but not GNU, IIRC) support. We needed this for our Docker PowerShell cmdlets. They might be a good starting point for this work.

https://github.com/Microsoft/Docker-PowerShell/tree/master/src/Tar

@jstarks

This comment has been minimized.

Show comment
Hide comment
@jstarks

jstarks Aug 24, 2016

I should also note that tar archives are fundamentally different from zip archives in that they are stream-oriented and do not contain a central directory of files. This means that both TarArchive.GetEntry and TarArchiveEntry.Open are unnatural: to implement either, you have to require a seekable stream, or you have to buffer the contents of the entire archive into memory or a temporary file (which obviously is uncompetitive from a performance perspective). And it's not realistic to require a seekable stream, since you'll want to support decompressing and extracting tar.gz files in one pass, and decompressors such as GZipStream are not seekable.

The reality is that tar demands a very different interface from zip. You may want to look at my implementation for some ideas of what works better with tar.

jstarks commented Aug 24, 2016

I should also note that tar archives are fundamentally different from zip archives in that they are stream-oriented and do not contain a central directory of files. This means that both TarArchive.GetEntry and TarArchiveEntry.Open are unnatural: to implement either, you have to require a seekable stream, or you have to buffer the contents of the entire archive into memory or a temporary file (which obviously is uncompetitive from a performance perspective). And it's not realistic to require a seekable stream, since you'll want to support decompressing and extracting tar.gz files in one pass, and decompressors such as GZipStream are not seekable.

The reality is that tar demands a very different interface from zip. You may want to look at my implementation for some ideas of what works better with tar.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Aug 25, 2016

Member

And it's not realistic to require a seekable stream, since you'll want to support decompressing and extracting tar.gz files in one pass, and decompressors such as GZipStream are not seekable.

In my above comment (under "Implementation") I was operating under the tentative plan that Entry indexing would require a seekable stream or throw an exception if it wasn't, but as you said this isn't likely to be frequently done since it will nearly always be wrapped in a GZip or LZMA stream. It may be worth adding anyways to cover edge cases, but I doubt it. Enumerating entries would be the preferred way of reading the archive.

The reality is that tar demands a very different interface from zip. You may want to look at my implementation for some ideas of what works better with tar.

The nice thing about not having a common parent with ZipArchive is that we can diverge the interface where it's necessary. While it would be ideal to have the API be similar, it isn't required. That said, I think we can at least keep the TarArchive/TarArchiveEntry structure if we just make some tweaks.

I recently implemented a set of classes for manipulating tar files, including ustar and PAX (but not GNU, IIRC) support. We needed this for our Docker PowerShell cmdlets. They might be a good starting point for this work.

Thanks @jstarks, that looks very close to what I had in mind with the exception of some minor API differences (e.g. a unified TarArchive class rather than a TarReader/TarWriter, IEnumerable Entries, disposable TarArchive) and of course the removal of indexing entries. Regardless of the structure though, the implementation looks good and could be easily adapted into the code base with some minor tweaks.

Member

ianhays commented Aug 25, 2016

And it's not realistic to require a seekable stream, since you'll want to support decompressing and extracting tar.gz files in one pass, and decompressors such as GZipStream are not seekable.

In my above comment (under "Implementation") I was operating under the tentative plan that Entry indexing would require a seekable stream or throw an exception if it wasn't, but as you said this isn't likely to be frequently done since it will nearly always be wrapped in a GZip or LZMA stream. It may be worth adding anyways to cover edge cases, but I doubt it. Enumerating entries would be the preferred way of reading the archive.

The reality is that tar demands a very different interface from zip. You may want to look at my implementation for some ideas of what works better with tar.

The nice thing about not having a common parent with ZipArchive is that we can diverge the interface where it's necessary. While it would be ideal to have the API be similar, it isn't required. That said, I think we can at least keep the TarArchive/TarArchiveEntry structure if we just make some tweaks.

I recently implemented a set of classes for manipulating tar files, including ustar and PAX (but not GNU, IIRC) support. We needed this for our Docker PowerShell cmdlets. They might be a good starting point for this work.

Thanks @jstarks, that looks very close to what I had in mind with the exception of some minor API differences (e.g. a unified TarArchive class rather than a TarReader/TarWriter, IEnumerable Entries, disposable TarArchive) and of course the removal of indexing entries. Regardless of the structure though, the implementation looks good and could be easily adapted into the code base with some minor tweaks.

@karelz karelz removed the suggestion label Sep 18, 2016

@ianhays ianhays removed their assignment Oct 10, 2016

@danmosemsft

This comment has been minimized.

Show comment
Hide comment
@danmosemsft

danmosemsft Dec 1, 2017

Member

Per discussion with @ianhays a conservative estimate for all this is 5 weeks, if it was forward only that would be less time.

Member

danmosemsft commented Dec 1, 2017

Per discussion with @ianhays a conservative estimate for all this is 5 weeks, if it was forward only that would be less time.

@ravarnamsft

This comment has been minimized.

Show comment
Hide comment
@ravarnamsft

ravarnamsft Apr 6, 2018

+1 on requesting this support on .net core. This would be super helpful for any cloud service that's trying to untar developer uploaded files from a linux machine. We are one of them (that uses .net core). Would really prefer to avoid picking third party libraries or resorting to running a shell process to achieve this. This becomes more crucial due to the fact that file permissions on zip archives are not set correctly for files archived on Linux. Given that neither this nor zip are fully functional out of the box on Linux makes it hard to support Linux platform on our service in a clean way.

ravarnamsft commented Apr 6, 2018

+1 on requesting this support on .net core. This would be super helpful for any cloud service that's trying to untar developer uploaded files from a linux machine. We are one of them (that uses .net core). Would really prefer to avoid picking third party libraries or resorting to running a shell process to achieve this. This becomes more crucial due to the fact that file permissions on zip archives are not set correctly for files archived on Linux. Given that neither this nor zip are fully functional out of the box on Linux makes it hard to support Linux platform on our service in a clean way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment