Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Forward-only API for System.IO.Compression #1550

Open
adamhathcock opened this issue Jun 1, 2017 · 37 comments
Open

Add a Forward-only API for System.IO.Compression #1550

adamhathcock opened this issue Jun 1, 2017 · 37 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression
Milestone

Comments

@adamhathcock
Copy link

This was started in https://github.com/dotnet/corefx/issues/9657

There seems to be a growing desire/need for a forward-only API that accesses compressed file formats (e.g. zip, gzip, tar, etc.) in a streaming manner. This means very large files as well as streams like network streams can be read and decompressed on the fly. Basically, the API reads from Stream objects and never seeks on it. This is how the Reader/Writer API from SharpCompress works.

Here's a sample from the unit tests:

using (Stream stream = new ForwardOnlyStream(File.OpenRead(path)))
using (IReader reader = ReaderFactory.Open(stream))
{
    while (reader.MoveToNextEntry())
    {
        if (!reader.Entry.IsDirectory)
        {
            reader.WriteEntryToDirectory(test.SCRATCH_FILES_PATH, new ExtractionOptions()
            {
                ExtractFullPath = true,
                Overwrite = true
            });
        }
    }
}

public interface IReader : IDisposable
{
    event EventHandler<ReaderExtractionEventArgs<IEntry>> EntryExtractionProgress;

    event EventHandler<CompressedBytesReadEventArgs> CompressedBytesRead;
    event EventHandler<FilePartExtractionBeginEventArgs> FilePartExtractionBegin;

    ArchiveType ArchiveType { get; }

    IEntry Entry { get; }

    /// <summary>
    /// Decompresses the current entry to the stream.  This cannot be called twice for the current entry.
    /// </summary>
    /// <param name="writableStream"></param>
    void WriteEntryTo(Stream writableStream);

    bool Cancelled { get; }
    void Cancel();

    /// <summary>
    /// Moves to the next entry by reading more data from the underlying stream.  This skips if data has not been read.
    /// </summary>
    /// <returns></returns>
    bool MoveToNextEntry();

    /// <summary>
    /// Opens the current entry as a stream that will decompress as it is read.
    /// Read the entire stream or use SkipEntry on EntryStream.
    /// </summary>
    EntryStream OpenEntryStream();
}

WriteEntryToDirectory is an extension method that provides some shortcuts for dealing with file operations but what it really does is just grab the internal stream and decompresses. The actual entry method is just IReader.WriteEntryTo(Stream); If the entry isn't decompressed then the internal stream is just moved forward and not decompressed if possible (some formats require decompression since compressed length can be unknown)

The Writer API works similarly.

There is also a generic API from ReaderFactory or WriterFactory that doesn't require knowledge of the format beforehand. There is also a similar ArchiveFactory that is writeable (that uses WriterFactory internally to save) that could also be used for the current ZipArchive API and beyond.

As the author of SharpCompress, I'd like to push a lot of the ideas into core library but having native access to the compression streams (like the internal zlib) would be a great performance benefit for me. I haven't ever written any compression algorithm implementations myself so I'm sure my managed implementations need a lot of love.

I would start by creating ZipReader and ZipWriter (as well as starting the generic API) using a lot of the internal code already in the core library to prove out the API. This kind of relates to https://github.com/dotnet/corefx/issues/14853 but forward-only access is something most libraries don't account for so I'm not sure. Other easy additions would be Reader/Writer support for GZip, BZip2 and LZip (with an LZMA compressor).

Tar support linked with the previous single file formats would be a larger addition. SharpCompress deals with tar.gz, tar.bz2 and tar.lz auto-magically by detecting first the compressed format then a tar file inside. The above API works the same.

Thoughts?

Summoning a few people from the other issue
cc: @ianhays @karelz @qmfrederik

@karelz
Copy link
Member

karelz commented Jun 12, 2017

@ianhays what is the next step? You did +1 above, does it mean you like the idea? Do we need API proposal? Or is it ready for API review? (Disclaimer: I just skimmed through it)

@adamhathcock
Copy link
Author

I can put together a new API based on what I've done. It would be very similar to SharpCompress.

@svick
Copy link
Contributor

svick commented Jun 12, 2017

  1. Is a new writer API necessary? AFAIK, ZipArchive already supports writing to non-seekable streams on .Net Core (https://github.com/dotnet/corefx/issues/11497).
  2. For reading, I think the API should return an IEnumerable<T>. That way, the user can use a foreach and they don't have to manually call something like MoveNext in a while loop. (Some LINQ methods also become a possibility.)

@adamhathcock
Copy link
Author

  1. You're right, this probably isn't strictly necessary. It would be the inverse of a proposed Reader API and it would have a generic writing implementation while the current one is specific to Zip and could probably not apply in a generic case. If a generic API isn't desired then this could possibly be dropped.
  2. I've gone back and forth if I wanted to do it that way. The reason why I haven't is because people could be tempted to do ToList() on the reader which would definitely not work and defeat the point of the reader API when they should have used the archive API.

@svick
Copy link
Contributor

svick commented Jun 12, 2017

Then maybe the API should return a type that can be used in a foreach, but does not implement IEnumerable<T>. Though that could be confusing to someone who sees the API for the first time.

@ianhays
Copy link
Contributor

ianhays commented Jun 12, 2017

If a generic API isn't desired then this could possibly be dropped.

I've gone back-and-forth about this myself. I opened https://github.com/dotnet/corefx/issues/9709 to track my thoughts long-term on a way to centralize our compression algorithms somehow. It's clear that as we continue adding more compression types to corefx it's going to become trickier and tricker to integrate them smoothly and expose them to devs of all skill levels without being overwhelming or overly simplistic.

I've gone back and forth if I wanted to do it that way. The reason why I haven't is because people could be tempted to do ToList() on the reader which would definitely not work and defeat the point of the reader API when they should have used the archive API.

I think in this case increased usability is more valuable than protecting devs from themselves. You could implement a code analysis rule specifically to look for instances where devs do a ToList() and suggest they avoid it.

@ianhays
Copy link
Contributor

ianhays commented Jun 12, 2017

https://github.com/dotnet/corefx/issues/9237 is also somewhat related, though less so.

@ianhays
Copy link
Contributor

ianhays commented Jun 12, 2017

@ianhays what is the next step? You did +1 above, does it mean you like the idea? Do we need API proposal? Or is it ready for API review? (Disclaimer: I just skimmed through it)

I do like the idea. My main concern is that I really want to do our compression expansion right. There are a ton of questions around design and usability that I want answered well before we take anything like this issue, dotnet/corefx#9237, or dotnet/corefx#9709. Adding something like LZMA is easy because we can model it after DeflateStream/GZipStream with a similar API, but this issue (and the others linked above) is far more complex in the long-term implications of its design and API expansion.

@karelz
Copy link
Member

karelz commented Jun 12, 2017

OK, what is the next step? (you coming back with larger design?) What is rough ETA?

@ianhays
Copy link
Contributor

ianhays commented Jun 12, 2017

OK, what is the next step? (you coming back with larger design?)

Continue discussing design choices in this issue, dotnet/corefx#9709, and dotnet/corefx#9237. Wait for other people to be interested and comment/vote on these threads with their opinion.

What is rough ETA?

Whenever we reach quorum and have a sizable mass of users requesting the feature. It's not something that we're hurting for right now or that a lot of customers are asking for, so I don't feel comfortable assigning an ETA for it.

@adamhathcock
Copy link
Author

adamhathcock commented Jun 13, 2017

Here's a rough Reader API based on what I've done with SharpCompress and making the Reader itself be IEnumerable might not quite be correct.

//The file format of the Archive.  Can be confusing as GZip is a single file format as well as a compression type (using Tar as archive type).
public enum ArchiveType
{
    Rar,
    Zip,
    Tar,
    SevenZip,
    GZip
}

//The compression type used on the Archive/Entry.  GZip,BZip2,LZip,Xz are single file formats as well as a compression type.
public enum CompressionType
{
    None,
    GZip,
    BZip2,
    PPMd,
    Deflate,
    Rar,
    LZMA,
    BCJ,
    BCJ2,
    LZip,
    Xz,
    Unknown
}

//possibly not needed
public class ReaderOptions
{
    //Password protected archives (out of scope?)
    public string Password { get; set; }

    //Take ownership of the stream
    public bool CloseStream { get; set; }
}

public static class ReaderFactory
{
    public static IReader Open(Stream stream, ReaderOptions options = null);
#if NETSTANDARD1_3
    public static IReader Open(FileInfo file, ReaderOptions options = null);
    public static IReader Open(string filePath, ReaderOptions options = null);
#endif
}

public interface IReader : IEnumerable<IReaderEntry>
{
    IEnumerator<IReaderEntry> GetEnumerator();

    ArchiveType ArchiveType { get; }

    IReaderEntry CurrentEntry { get; }

    void Dispose();
}

public interface IReaderEntry
{
    Stream OpenRead();
}

public interface IEntry
{
    //This is on entry because some archive types (like ZIP) can have compression types different per entry
    CompressionType CompressionType { get; }
    string Key { get; }

    //everything below here is optional depending on archive type
    DateTime? ArchivedTime { get; }
    
    long? CompressedSize { get; }
    long? Crc { get; }
    long? Size { get; }

    DateTime? CreatedTime { get; }
    
    bool? IsDirectory { get; }
    bool? IsEncrypted { get; }
    bool? IsSplit { get; }
    
    DateTime? LastAccessedTime { get; }
    DateTime? LastModifiedTime { get; }
}

public static class IReaderExtensions
{
#if NETSTANDARD1_3
    public static void WriteEntryTo(this IReaderEntry reader, string filePath);
    /// <summary>
    /// Extract all remaining unread entries to specific directory, retaining filename
    /// </summary>
    public static void WriteAllToDirectory(this IReader reader, string destinationDirectory,
                                            ExtractionOptions options = null);

    /// <summary>
    /// Extract to specific directory, retaining filename
    /// </summary>
    public static void WriteEntryToDirectory(this IReaderEntry reader, string destinationDirectory,
                                                ExtractionOptions options = null);

    /// <summary>
    /// Extract to specific file
    /// </summary>
    public static void WriteEntryToFile(this IReaderEntry reader, string destinationFileName,
                                        ExtractionOptions options = null);
#endif
}
[Flags]
public enum ExtractionOptions
{
    None,
    /// <summary>
    /// overwrite target if it exists
    /// </summary>
    Overwrite,

    /// <summary>
    /// extract with internal directory structure
    /// </summary>
    ExtractFullPath,

    /// <summary>
    /// preserve file time
    /// </summary>
    PreserveFileTime,

    /// <summary>
    /// preserve windows file attributes
    /// </summary>
    PreserveAttributes,
}

Sample usage

public static void Main() 
{
    using(var reader = ReaderFactory.Open("C:\\test.zip"))
    {
        foreach(var entry in reader)
        {
            if (!entry.IsDirectory)
            {
                entry.WriteEntryToDirectory("D:\\temp", ExtractionOptions.ExtractFullPath);
            }
        }
    }

    using(var reader = ReaderFactory.Open("C:\\test.zip"))
    {
        reader.WriteAllToDirectory("D:\\temp2", ExtractionOptions.ExtractFullPath);
    }

//no explicit using statement for IReader
    foreach(var entry in ReaderFactory.Open("C:\\test.zip")))
    {
        if (entry.Key.StartsWith("x"))
        {
            using(var data = entry.OpenRead()) 
            {
                data.CopyTo(new MemoryStream());
            }
        }
    }
}

Can expand out to the show the generic Archive API too.

@svick
Copy link
Contributor

svick commented Jun 13, 2017

@ianhays

My main concern is that I really want to do our compression expansion right. There are a ton of questions around design and usability that I want answered well before we take anything like this issue, dotnet/corefx#9237, or dotnet/corefx#9709.

Would it then make sense to design (and implement) an initial version of the whole compression expansion in corefxlab? That way, all the related APIs can be designed together and easily inform each other.

@adamhathcock
Copy link
Author

Putting it wholesale together as a real thing in corefxlab makes sense. Evolving the API might be too slow.

@ianhays
Copy link
Contributor

ianhays commented Jun 13, 2017

Would it then make sense to design (and implement) an initial version of the whole compression expansion in corefxlab? That way, all the related APIs can be designed together and easily inform each other.

That is probably the best idea. We can design the large structure in corefxlab, then move small pieces to corefx one-by-one with their own justification so that we can actually get them through API review.

Here's a rough Reader API based on what I've done with SharpCompress and making the Reader itself be IEnumerable might not quite be correct.

I like that too. Your rough API is pretty similar to what I had in mind.

A couple suggestions:

  • ReaderOptions should include an option to specify the ArchiveType explicitly.
  • ArchiveType.GZip should be called TarGZip or GZipTar to avoid confusion.
  • ExtractionOptions should have a PreservePermissions entry.
  • Should there be some relation between IReaderEntry and IEntry? Where is the latter used?
  • Instead of having IReader.WriteEntryTo and friends, I think those should be extension methods on the IReaderEntry (or IEntry) itself.
  • All of the instances of Reader/IReader should be replaced with ArchiveReader or IArchiveReader (or something like that).

On a more general note, could you explain why you split reading and writing to their own interfaces/classes? Having a single IArchive seems more intuitive and usable despite opening up some notsupported operations e.g. attempting to write to an archive that is being read.

@ianhays
Copy link
Contributor

ianhays commented Jun 13, 2017

public interface IEntry

It should be possible to combine this with the concept of a compression algorithm interface a la https://github.com/dotnet/corefx/issues/9709.

@adamhathcock
Copy link
Author

ReaderOptions should include an option to specify the ArchiveType explicitly.

Makes sense to explicitly try an ArchiveType.

ArchiveType.GZip should be called TarGZip or GZipTar to avoid confusion.

That's a way to go. Explicitly identify Tar archives with a matching single file compression.

ExtractionOptions should have a PreservePermissions entry.

Can do.

Should there be some relation between IReaderEntry and IEntry? Where is the latter used?

IEntry is used with IArchive. Just want to reuse the object among both the forward-only API and random access API.

Instead of having IReader.WriteEntryTo and friends, I think those should be extension methods on the IReaderEntry (or IEntry) itself.

Only one of the extensions is for IReader. The others are for IReaderEntry. I guess it's not explicit as they're all in the same extension class.

All of the instances of Reader/IReader should be replaced with ArchiveReader or IArchiveReader (or something like that).

I disagree. Archive is random access (like the current ZipArchive is random access). The forward-only requires a different way of thinking. This goes back to using IEnumerable or not. Archive is a "nicer" API. However, for perf concerns, you should learn a specialized forward-only API (Reader). There should be a matching ArchiveFactory that is random access and can be read/write. This is what exists in SharpCompress. If you try to combine random access and forward-only perf in the same API, you're in for a world of hurt.

It should be possible to combine this with the concept of a compression algorithm interface a la dotnet/corefx#9709.

I don't think so. You could probably put DeflateStream, LZMAStream, etc behind CompressionStream. IEntry is about reading the metadata of an entry in an archive. Then choosing to read/extract the data or skip it.

@ianhays
Copy link
Contributor

ianhays commented Jun 14, 2017

That's a way to go. Explicitly identify Tar archives with a matching single file compression.

Yeah. It's not ideal though.

IEntry is used with IArchive. Just want to reuse the object among both the forward-only API and random access API.
Only one of the extensions is for IReader. The others are for IReaderEntry. I guess it's not explicit as they're all in the same extension class.

I see, thanks.

This is what exists in SharpCompress. If you try to combine random access and forward-only perf in the same API, you're in for a world of hurt.

To me it seems like a lot of distinct interfaces and classes to protect the programmer from themselves. I'd personally much rather use one interface/class like how ZipArchive can either read or write depending on the stream passed to it.

@adamhathcock
Copy link
Author

Yeah. It's not ideal though.

Then we don't need it! Can already match based on headers.

To me it seems like a lot of distinct interfaces and classes to protect the programmer from themselves.

It's not about protecting, it's about guiding. People often say they like the SharpCompress API better than others because its more explicit. I'm definitely against combining the forward-only and random access scenarios. The use-cases have very different ways they need to be handled and forcing the APIs together feels unnatural. Knowing which methods/properties can be used and when puts a lot of cognitive burden on the user. Fixing this doesn't result in reduced performance and simpler code.

I'd personally much rather use one interface/class like how ZipArchive can either read or write depending on the stream passed to it.

The Archive API does do read/writes in SharpCompress. The Reader/Writer API has a clean break as these actions in a forward-only scenario are very different.

I'm just offloading my experience with SharpCompress. I'm not saying there isn't a better way than the direction I've gone but I don't want the same mistakes repeated.

@qmfrederik
Copy link
Contributor

I disagree. Archive is random access (like the current ZipArchive is random access). The forward-only requires a different way of thinking. This goes back to using IEnumerable or not. Archive is a "nicer" API. However, for perf concerns, you should learn a specialized forward-only API (Reader). There should be a matching ArchiveFactory that is random access and can be read/write. This is what exists in SharpCompress. If you try to combine random access and forward-only perf in the same API, you're in for a world of hurt.

Some archive types are designed to be read forward-only (such as CPIO or tar archives). On the other hand, zip archives have a central directory at the end of the file, so you can have a very efficient random access API for zip archives, too.

@adamhathcock
Copy link
Author

adamhathcock commented Jun 15, 2017

Some archive types are designed to be read forward-only (such as CPIO or tar archives). On the other hand, zip archives have a central directory at the end of the file, so you can have a very efficient random access API for zip archives, too.

Trust me, I know. I've been dealing and creating SharpCompress for almost a decade.

For random access APIs (Archive): If there is no central directory, then you scan the file. If the sizes of the entries are known, then you can just seek to headers and it's not a big deal.

Zip is the only format that really does have a central directory. 7Zip has funky stuff (and I hate it ) and everything else is basically <header><data><header><data> (RAR, Tar, etc.) so they need to be scanned.

My point is that for very large files in certain cases you want a forward-only API to for performance reasons. Everything except 7Zip and Zip64 (when writing) can be done this way. Zip has post data descriptors when writing out (though Zip64 doesn't support this). Tar can be written out as a forward-only archive if you know the data size for the header.

@ianhays
Copy link
Contributor

ianhays commented Jun 16, 2017

Then we don't need it! Can already match based on headers.

I was more thinking about having it for the Writer.

It's not about protecting, it's about guiding. People often say they like the SharpCompress API better than others because its more explicit. I'm definitely against combining the forward-only and random access scenarios. The use-cases have very different ways they need to be handled and forcing the APIs together feels unnatural. Knowing which methods/properties can be used and when puts a lot of cognitive burden on the user. Fixing this doesn't result in reduced performance and simpler code.

I definitely don't disagree with you. I personally have always disliked having the same class for writing/reading because it's more difficult to reason about. I don't even like the words 'Write'/'Read' being used in DeflateStream/GZipStream, but they're necessary to implement Stream.

I mostly just want to make sure we're avoiding unnecessary duplication and/or API's with low usage from redundancy.

You've convinced me that we should at least have an IForwardArchive and an IArchive.

If I was designing this from scratch I would probably aim to have something like (very rough):

    public interface IForwardArchive : IEnumerable<IArchiveEntry>
    {
        // Reader functions
        IEnumerator<IArchiveEntry> GetEnumerator();
        IReaderEntry CurrentEntry { get; }

        // Writer functions
        void Write(string filename, Stream source, DateTime? modificationTime);

        // Shared functions
        ArchiveType Type { get; }
        void Dispose();
    }

    public interface IArchive : IForwardArchive
    {
        IEnumerable<IArchiveEntry> Entries { get; }
        IEnumerable<IVolume> Volumes { get; }

        bool IsSolid { get; }
        bool IsComplete { get; }
        long TotalSize { get; }
        long TotalUncompressSize { get; }
    }

@adamhathcock
Copy link
Author

I was more thinking about having it for the Writer.

Yes, with a writer you have to specify an archive type.

I mostly just want to make sure we're avoiding unnecessary duplication and/or API's with low usage from redundancy.

You've convinced me that we should at least have an IForwardArchive and an IArchive.

If I was designing this from scratch I would probably aim to have something like (very rough):

I don't think that layout quite works. You shouldn't have CurrentEntry on IArchive. It doesn't have the concept of a current entry because it's random access. There is very little shared between IArchive and IForwardArchive really. The concepts just don't match much other than you're dealing with entries. One case is a only an enumerable the other is a collection.

You also want to remove Write from the forward-only reading functions. For a forward-only writer, you want to explicitly create an entry each time and pass a byte stream to the writer to handle a potential large stream or network stream just one at a time. Again, mixing writing and reading in a forward-only context here just creates confusion in the API.

I don't even like the words 'Write'/'Read' being used in DeflateStream/GZipStream, but they're necessary to implement Stream.

I don't like Read/Write being on Stream at the same time either. One thing the Java API got correct was separating it :) Having to always check CanWrite or CanRead can be burdensome.

@am11
Copy link
Member

am11 commented Jul 8, 2017

Should this proposal be considered in conjunction with the upcoming System.IO.Pipeline: https://github.com/dotnet/corefxlab/blob/091d12e/docs/specs/pipelines-io.md?

There seems to be compression APIs in place based on pipelines as well, under System.IO.Pipelines.Compression namespace: https://github.com/dotnet/corefxlab/tree/091d12e/src/System.IO.Pipelines.Compression

@karelz
Copy link
Member

karelz commented Jul 9, 2017

It is currently unclear if System.IO.Pipeline will make it into CoreFX or not. There are still active discussions around that. It may just sit on top of CoreFX.

If there are Compression investments in the Pipelines space, I would hope they are considered to be done in lower layers (System.IO.Compression) first, so that more people/apps can benefit from the investments. @joshfree @KrzysztofCwalina can you please comment?

@joshfree
Copy link
Member

If there are Compression investments in the Pipelines space, I would hope they are considered to be done in lower layers (System.IO.Compression) first, so that more people/apps can benefit from the investments.

Yes, any actual compression investments are being done at a lower level. For instance one of our summer interns is working on System.IO.Compression.Brotli for a future release in .NET Core @ https://github.com/dotnet/corefxlab/tree/master/src/System.IO.Compression.Brotli.

@twsouthwick
Copy link
Member

We have encountered a few issues in System.IO.Packaging that users of DocumentFormat.OpenXml (a common Office document manipulation library) have hit that require streaming support in zip files. I'm not familiar with the domain enough to add much to the API discussion, but would love to see some movement on this issue.

@ghost
Copy link

ghost commented Nov 16, 2017

I want to add my voice agreeing with @twsouthwick - this is critically needed for the OpenXML library.

thanks - dave

@deinok
Copy link

deinok commented Feb 26, 2019

In the case of *.tar.* files, why add direct support for the compression cases?
I mean, it feels more natural to for example call the constructor with a GZipStream

@danmoseley
Copy link
Member

MOved to future. This is not a must fix for 3.0

@carlossanlop
Copy link
Member

carlossanlop commented Dec 16, 2019

Triage:
Complex feature and entirely different API model. Significant investment. There seems to be interest in it, including partners like NuGet and WPF. It could probably be implemented on top of DeflateStream. The API design could be similar to the one required for TAR #1540 .

@adamhathcock
Copy link
Author

Agreed with the API similar to Tar!

The problem with Zip is that the format is streamable but not all Zips are created with the required header info to make them stream-readable. I just throw exceptions in Sharpcompress in this case.

The current implementation of Zip requires using the dictionary at the end.

GZipStream is also a problem because the current implementation doesn't expose the file headers/metadata to use it properly as an archive format. (Related to dotnet/corefx#7570 )

@scalablecory
Copy link
Contributor

I have a library that parses XLSX files which would have made good use of this.

@markusschaber
Copy link

I like the general idea, despite of the problems with ZIP format. In most environments, one might be able to constrain itself to compliant ZIPs, and it's very useful for TAR , too.

However, some of the methods might need Async variants, I guess, especially WriteEntryTo which might potentially take a very long time. :-)

@carlossanlop carlossanlop transferred this issue from dotnet/corefx Jan 9, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.IO.Compression untriaged New issue has not been triaged by the area owner labels Jan 9, 2020
@carlossanlop carlossanlop added api-suggestion Early API idea and discussion, it is NOT ready for implementation and removed untriaged New issue has not been triaged by the area owner labels Jan 9, 2020
@carlossanlop carlossanlop added this to the Future milestone Jan 9, 2020
@iUnknwn
Copy link

iUnknwn commented Jun 2, 2020

Wanted to add a bit more information on this topic - I am one of the authors' of Microsoft SimpleRemote that we use for testing in Surface, and we ended up using a 3rd party tar library because there was no way to do forward-only operations with what's included in the current Compression namespace.

This is less of an API defect and more of a problem with the zip format itself - Zip is easy to stream when writing to it, because it stores metadata for the files at the end of the archive, so you can easily write to the stream without ever seeking. Reading, however, is incredibly difficult, because you need to read end of the file before operating on the archive.

To address this, if you try to open a non-seekable stream with ZipArchive, .NET internally reads the entire stream into a memory stream before processing (this works in some cases, but does not work on zip files larger than 4GB, as memory stream does not support streams larger than int32 bytes).

Here's an example of where this is a problem: Let's say you're reading a zip stream over a network - this means you'd need to download the entire file before being able to access any of the contents. If we compare this to tar (or cpio, or any tar-like format), we don't have this restriction - we can freely read elements as they come in, since all metadata for a given file is just before that file.

This is a problem for any case where the source stream does not support seeking (if we're encrypting data using cryptostreams, for example, they also aren't seekable). Adding tar, or another archive that supports forward only reads and writes, would make these scenarios much more straightforward and performant.

One note, though - if we implement tar, please use something that, at minimum, supports USTAR (and, ideally, supports the 2001 PAX extensions, so we can write files with unlimited size - historic tar has an 8GB/entry cap).

@adamhathcock
Copy link
Author

SharpCompress does what you need but the Tar support isn’t super great. Would love help expanding it.

@iUnknwn
Copy link

iUnknwn commented Jun 9, 2020

@adamhathcock I'll take a look! From a quick glance at the code, it looks like SharpCompress Tar entries are USTAR + GNU extensions, correct? It should be pretty easy to add basic support for PAX extensions for long-file, long-link, and large sizes.

@bjh0
Copy link

bjh0 commented Sep 6, 2021

Is there an update on this issue? This issue affects my use of the OpenXML SDK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression
Projects
None yet
Development

No branches or pull requests