New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Forward-only API for System.IO.Compression #20558

Open
adamhathcock opened this Issue Jun 1, 2017 · 27 comments

Comments

Projects
None yet
10 participants
@adamhathcock

adamhathcock commented Jun 1, 2017

This was started in #9657

There seems to be a growing desire/need for a forward-only API that accesses compressed file formats (e.g. zip, gzip, tar, etc.) in a streaming manner. This means very large files as well as streams like network streams can be read and decompressed on the fly. Basically, the API reads from Stream objects and never seeks on it. This is how the Reader/Writer API from SharpCompress works.

Here's a sample from the unit tests:

using (Stream stream = new ForwardOnlyStream(File.OpenRead(path)))
using (IReader reader = ReaderFactory.Open(stream))
{
    while (reader.MoveToNextEntry())
    {
        if (!reader.Entry.IsDirectory)
        {
            reader.WriteEntryToDirectory(test.SCRATCH_FILES_PATH, new ExtractionOptions()
            {
                ExtractFullPath = true,
                Overwrite = true
            });
        }
    }
}

public interface IReader : IDisposable
{
    event EventHandler<ReaderExtractionEventArgs<IEntry>> EntryExtractionProgress;

    event EventHandler<CompressedBytesReadEventArgs> CompressedBytesRead;
    event EventHandler<FilePartExtractionBeginEventArgs> FilePartExtractionBegin;

    ArchiveType ArchiveType { get; }

    IEntry Entry { get; }

    /// <summary>
    /// Decompresses the current entry to the stream.  This cannot be called twice for the current entry.
    /// </summary>
    /// <param name="writableStream"></param>
    void WriteEntryTo(Stream writableStream);

    bool Cancelled { get; }
    void Cancel();

    /// <summary>
    /// Moves to the next entry by reading more data from the underlying stream.  This skips if data has not been read.
    /// </summary>
    /// <returns></returns>
    bool MoveToNextEntry();

    /// <summary>
    /// Opens the current entry as a stream that will decompress as it is read.
    /// Read the entire stream or use SkipEntry on EntryStream.
    /// </summary>
    EntryStream OpenEntryStream();
}

WriteEntryToDirectory is an extension method that provides some shortcuts for dealing with file operations but what it really does is just grab the internal stream and decompresses. The actual entry method is just IReader.WriteEntryTo(Stream); If the entry isn't decompressed then the internal stream is just moved forward and not decompressed if possible (some formats require decompression since compressed length can be unknown)

The Writer API works similarly.

There is also a generic API from ReaderFactory or WriterFactory that doesn't require knowledge of the format beforehand. There is also a similar ArchiveFactory that is writeable (that uses WriterFactory internally to save) that could also be used for the current ZipArchive API and beyond.

As the author of SharpCompress, I'd like to push a lot of the ideas into core library but having native access to the compression streams (like the internal zlib) would be a great performance benefit for me. I haven't ever written any compression algorithm implementations myself so I'm sure my managed implementations need a lot of love.

I would start by creating ZipReader and ZipWriter (as well as starting the generic API) using a lot of the internal code already in the core library to prove out the API. This kind of relates to #14853 but forward-only access is something most libraries don't account for so I'm not sure. Other easy additions would be Reader/Writer support for GZip, BZip2 and LZip (with an LZMA compressor).

Tar support linked with the previous single file formats would be a larger addition. SharpCompress deals with tar.gz, tar.bz2 and tar.lz auto-magically by detecting first the compressed format then a tar file inside. The above API works the same.

Thoughts?

Summoning a few people from the other issue
cc: @ianhays @karelz @qmfrederik

@karelz

This comment has been minimized.

Show comment
Hide comment
@karelz

karelz Jun 12, 2017

Member

@ianhays what is the next step? You did +1 above, does it mean you like the idea? Do we need API proposal? Or is it ready for API review? (Disclaimer: I just skimmed through it)

Member

karelz commented Jun 12, 2017

@ianhays what is the next step? You did +1 above, does it mean you like the idea? Do we need API proposal? Or is it ready for API review? (Disclaimer: I just skimmed through it)

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock Jun 12, 2017

I can put together a new API based on what I've done. It would be very similar to SharpCompress.

adamhathcock commented Jun 12, 2017

I can put together a new API based on what I've done. It would be very similar to SharpCompress.

@svick

This comment has been minimized.

Show comment
Hide comment
@svick

svick Jun 12, 2017

Contributor
  1. Is a new writer API necessary? AFAIK, ZipArchive already supports writing to non-seekable streams on .Net Core (#11497).
  2. For reading, I think the API should return an IEnumerable<T>. That way, the user can use a foreach and they don't have to manually call something like MoveNext in a while loop. (Some LINQ methods also become a possibility.)
Contributor

svick commented Jun 12, 2017

  1. Is a new writer API necessary? AFAIK, ZipArchive already supports writing to non-seekable streams on .Net Core (#11497).
  2. For reading, I think the API should return an IEnumerable<T>. That way, the user can use a foreach and they don't have to manually call something like MoveNext in a while loop. (Some LINQ methods also become a possibility.)
@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock Jun 12, 2017

  1. You're right, this probably isn't strictly necessary. It would be the inverse of a proposed Reader API and it would have a generic writing implementation while the current one is specific to Zip and could probably not apply in a generic case. If a generic API isn't desired then this could possibly be dropped.
  2. I've gone back and forth if I wanted to do it that way. The reason why I haven't is because people could be tempted to do ToList() on the reader which would definitely not work and defeat the point of the reader API when they should have used the archive API.

adamhathcock commented Jun 12, 2017

  1. You're right, this probably isn't strictly necessary. It would be the inverse of a proposed Reader API and it would have a generic writing implementation while the current one is specific to Zip and could probably not apply in a generic case. If a generic API isn't desired then this could possibly be dropped.
  2. I've gone back and forth if I wanted to do it that way. The reason why I haven't is because people could be tempted to do ToList() on the reader which would definitely not work and defeat the point of the reader API when they should have used the archive API.
@svick

This comment has been minimized.

Show comment
Hide comment
@svick

svick Jun 12, 2017

Contributor

Then maybe the API should return a type that can be used in a foreach, but does not implement IEnumerable<T>. Though that could be confusing to someone who sees the API for the first time.

Contributor

svick commented Jun 12, 2017

Then maybe the API should return a type that can be used in a foreach, but does not implement IEnumerable<T>. Though that could be confusing to someone who sees the API for the first time.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jun 12, 2017

Member

If a generic API isn't desired then this could possibly be dropped.

I've gone back-and-forth about this myself. I opened #9709 to track my thoughts long-term on a way to centralize our compression algorithms somehow. It's clear that as we continue adding more compression types to corefx it's going to become trickier and tricker to integrate them smoothly and expose them to devs of all skill levels without being overwhelming or overly simplistic.

I've gone back and forth if I wanted to do it that way. The reason why I haven't is because people could be tempted to do ToList() on the reader which would definitely not work and defeat the point of the reader API when they should have used the archive API.

I think in this case increased usability is more valuable than protecting devs from themselves. You could implement a code analysis rule specifically to look for instances where devs do a ToList() and suggest they avoid it.

Member

ianhays commented Jun 12, 2017

If a generic API isn't desired then this could possibly be dropped.

I've gone back-and-forth about this myself. I opened #9709 to track my thoughts long-term on a way to centralize our compression algorithms somehow. It's clear that as we continue adding more compression types to corefx it's going to become trickier and tricker to integrate them smoothly and expose them to devs of all skill levels without being overwhelming or overly simplistic.

I've gone back and forth if I wanted to do it that way. The reason why I haven't is because people could be tempted to do ToList() on the reader which would definitely not work and defeat the point of the reader API when they should have used the archive API.

I think in this case increased usability is more valuable than protecting devs from themselves. You could implement a code analysis rule specifically to look for instances where devs do a ToList() and suggest they avoid it.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jun 12, 2017

Member

#9237 is also somewhat related, though less so.

Member

ianhays commented Jun 12, 2017

#9237 is also somewhat related, though less so.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jun 12, 2017

Member

@ianhays what is the next step? You did +1 above, does it mean you like the idea? Do we need API proposal? Or is it ready for API review? (Disclaimer: I just skimmed through it)

I do like the idea. My main concern is that I really want to do our compression expansion right. There are a ton of questions around design and usability that I want answered well before we take anything like this issue, #9237, or #9709. Adding something like LZMA is easy because we can model it after DeflateStream/GZipStream with a similar API, but this issue (and the others linked above) is far more complex in the long-term implications of its design and API expansion.

Member

ianhays commented Jun 12, 2017

@ianhays what is the next step? You did +1 above, does it mean you like the idea? Do we need API proposal? Or is it ready for API review? (Disclaimer: I just skimmed through it)

I do like the idea. My main concern is that I really want to do our compression expansion right. There are a ton of questions around design and usability that I want answered well before we take anything like this issue, #9237, or #9709. Adding something like LZMA is easy because we can model it after DeflateStream/GZipStream with a similar API, but this issue (and the others linked above) is far more complex in the long-term implications of its design and API expansion.

@karelz

This comment has been minimized.

Show comment
Hide comment
@karelz

karelz Jun 12, 2017

Member

OK, what is the next step? (you coming back with larger design?) What is rough ETA?

Member

karelz commented Jun 12, 2017

OK, what is the next step? (you coming back with larger design?) What is rough ETA?

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jun 12, 2017

Member

OK, what is the next step? (you coming back with larger design?)

Continue discussing design choices in this issue, #9709, and #9237. Wait for other people to be interested and comment/vote on these threads with their opinion.

What is rough ETA?

Whenever we reach quorum and have a sizable mass of users requesting the feature. It's not something that we're hurting for right now or that a lot of customers are asking for, so I don't feel comfortable assigning an ETA for it.

Member

ianhays commented Jun 12, 2017

OK, what is the next step? (you coming back with larger design?)

Continue discussing design choices in this issue, #9709, and #9237. Wait for other people to be interested and comment/vote on these threads with their opinion.

What is rough ETA?

Whenever we reach quorum and have a sizable mass of users requesting the feature. It's not something that we're hurting for right now or that a lot of customers are asking for, so I don't feel comfortable assigning an ETA for it.

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock Jun 13, 2017

Here's a rough Reader API based on what I've done with SharpCompress and making the Reader itself be IEnumerable might not quite be correct.

//The file format of the Archive.  Can be confusing as GZip is a single file format as well as a compression type (using Tar as archive type).
public enum ArchiveType
{
    Rar,
    Zip,
    Tar,
    SevenZip,
    GZip
}

//The compression type used on the Archive/Entry.  GZip,BZip2,LZip,Xz are single file formats as well as a compression type.
public enum CompressionType
{
    None,
    GZip,
    BZip2,
    PPMd,
    Deflate,
    Rar,
    LZMA,
    BCJ,
    BCJ2,
    LZip,
    Xz,
    Unknown
}

//possibly not needed
public class ReaderOptions
{
    //Password protected archives (out of scope?)
    public string Password { get; set; }

    //Take ownership of the stream
    public bool CloseStream { get; set; }
}

public static class ReaderFactory
{
    public static IReader Open(Stream stream, ReaderOptions options = null);
#if NETSTANDARD1_3
    public static IReader Open(FileInfo file, ReaderOptions options = null);
    public static IReader Open(string filePath, ReaderOptions options = null);
#endif
}

public interface IReader : IEnumerable<IReaderEntry>
{
    IEnumerator<IReaderEntry> GetEnumerator();

    ArchiveType ArchiveType { get; }

    IReaderEntry CurrentEntry { get; }

    void Dispose();
}

public interface IReaderEntry
{
    Stream OpenRead();
}

public interface IEntry
{
    //This is on entry because some archive types (like ZIP) can have compression types different per entry
    CompressionType CompressionType { get; }
    string Key { get; }

    //everything below here is optional depending on archive type
    DateTime? ArchivedTime { get; }
    
    long? CompressedSize { get; }
    long? Crc { get; }
    long? Size { get; }

    DateTime? CreatedTime { get; }
    
    bool? IsDirectory { get; }
    bool? IsEncrypted { get; }
    bool? IsSplit { get; }
    
    DateTime? LastAccessedTime { get; }
    DateTime? LastModifiedTime { get; }
}

public static class IReaderExtensions
{
#if NETSTANDARD1_3
    public static void WriteEntryTo(this IReaderEntry reader, string filePath);
    /// <summary>
    /// Extract all remaining unread entries to specific directory, retaining filename
    /// </summary>
    public static void WriteAllToDirectory(this IReader reader, string destinationDirectory,
                                            ExtractionOptions options = null);

    /// <summary>
    /// Extract to specific directory, retaining filename
    /// </summary>
    public static void WriteEntryToDirectory(this IReaderEntry reader, string destinationDirectory,
                                                ExtractionOptions options = null);

    /// <summary>
    /// Extract to specific file
    /// </summary>
    public static void WriteEntryToFile(this IReaderEntry reader, string destinationFileName,
                                        ExtractionOptions options = null);
#endif
}
[Flags]
public enum ExtractionOptions
{
    None,
    /// <summary>
    /// overwrite target if it exists
    /// </summary>
    Overwrite,

    /// <summary>
    /// extract with internal directory structure
    /// </summary>
    ExtractFullPath,

    /// <summary>
    /// preserve file time
    /// </summary>
    PreserveFileTime,

    /// <summary>
    /// preserve windows file attributes
    /// </summary>
    PreserveAttributes,
}

Sample usage

public static void Main() 
{
    using(var reader = ReaderFactory.Open("C:\\test.zip"))
    {
        foreach(var entry in reader)
        {
            if (!entry.IsDirectory)
            {
                entry.WriteEntryToDirectory("D:\\temp", ExtractionOptions.ExtractFullPath);
            }
        }
    }

    using(var reader = ReaderFactory.Open("C:\\test.zip"))
    {
        reader.WriteAllToDirectory("D:\\temp2", ExtractionOptions.ExtractFullPath);
    }

//no explicit using statement for IReader
    foreach(var entry in ReaderFactory.Open("C:\\test.zip")))
    {
        if (entry.Key.StartsWith("x"))
        {
            using(var data = entry.OpenRead()) 
            {
                data.CopyTo(new MemoryStream());
            }
        }
    }
}

Can expand out to the show the generic Archive API too.

adamhathcock commented Jun 13, 2017

Here's a rough Reader API based on what I've done with SharpCompress and making the Reader itself be IEnumerable might not quite be correct.

//The file format of the Archive.  Can be confusing as GZip is a single file format as well as a compression type (using Tar as archive type).
public enum ArchiveType
{
    Rar,
    Zip,
    Tar,
    SevenZip,
    GZip
}

//The compression type used on the Archive/Entry.  GZip,BZip2,LZip,Xz are single file formats as well as a compression type.
public enum CompressionType
{
    None,
    GZip,
    BZip2,
    PPMd,
    Deflate,
    Rar,
    LZMA,
    BCJ,
    BCJ2,
    LZip,
    Xz,
    Unknown
}

//possibly not needed
public class ReaderOptions
{
    //Password protected archives (out of scope?)
    public string Password { get; set; }

    //Take ownership of the stream
    public bool CloseStream { get; set; }
}

public static class ReaderFactory
{
    public static IReader Open(Stream stream, ReaderOptions options = null);
#if NETSTANDARD1_3
    public static IReader Open(FileInfo file, ReaderOptions options = null);
    public static IReader Open(string filePath, ReaderOptions options = null);
#endif
}

public interface IReader : IEnumerable<IReaderEntry>
{
    IEnumerator<IReaderEntry> GetEnumerator();

    ArchiveType ArchiveType { get; }

    IReaderEntry CurrentEntry { get; }

    void Dispose();
}

public interface IReaderEntry
{
    Stream OpenRead();
}

public interface IEntry
{
    //This is on entry because some archive types (like ZIP) can have compression types different per entry
    CompressionType CompressionType { get; }
    string Key { get; }

    //everything below here is optional depending on archive type
    DateTime? ArchivedTime { get; }
    
    long? CompressedSize { get; }
    long? Crc { get; }
    long? Size { get; }

    DateTime? CreatedTime { get; }
    
    bool? IsDirectory { get; }
    bool? IsEncrypted { get; }
    bool? IsSplit { get; }
    
    DateTime? LastAccessedTime { get; }
    DateTime? LastModifiedTime { get; }
}

public static class IReaderExtensions
{
#if NETSTANDARD1_3
    public static void WriteEntryTo(this IReaderEntry reader, string filePath);
    /// <summary>
    /// Extract all remaining unread entries to specific directory, retaining filename
    /// </summary>
    public static void WriteAllToDirectory(this IReader reader, string destinationDirectory,
                                            ExtractionOptions options = null);

    /// <summary>
    /// Extract to specific directory, retaining filename
    /// </summary>
    public static void WriteEntryToDirectory(this IReaderEntry reader, string destinationDirectory,
                                                ExtractionOptions options = null);

    /// <summary>
    /// Extract to specific file
    /// </summary>
    public static void WriteEntryToFile(this IReaderEntry reader, string destinationFileName,
                                        ExtractionOptions options = null);
#endif
}
[Flags]
public enum ExtractionOptions
{
    None,
    /// <summary>
    /// overwrite target if it exists
    /// </summary>
    Overwrite,

    /// <summary>
    /// extract with internal directory structure
    /// </summary>
    ExtractFullPath,

    /// <summary>
    /// preserve file time
    /// </summary>
    PreserveFileTime,

    /// <summary>
    /// preserve windows file attributes
    /// </summary>
    PreserveAttributes,
}

Sample usage

public static void Main() 
{
    using(var reader = ReaderFactory.Open("C:\\test.zip"))
    {
        foreach(var entry in reader)
        {
            if (!entry.IsDirectory)
            {
                entry.WriteEntryToDirectory("D:\\temp", ExtractionOptions.ExtractFullPath);
            }
        }
    }

    using(var reader = ReaderFactory.Open("C:\\test.zip"))
    {
        reader.WriteAllToDirectory("D:\\temp2", ExtractionOptions.ExtractFullPath);
    }

//no explicit using statement for IReader
    foreach(var entry in ReaderFactory.Open("C:\\test.zip")))
    {
        if (entry.Key.StartsWith("x"))
        {
            using(var data = entry.OpenRead()) 
            {
                data.CopyTo(new MemoryStream());
            }
        }
    }
}

Can expand out to the show the generic Archive API too.

@svick

This comment has been minimized.

Show comment
Hide comment
@svick

svick Jun 13, 2017

Contributor

@ianhays

My main concern is that I really want to do our compression expansion right. There are a ton of questions around design and usability that I want answered well before we take anything like this issue, #9237, or #9709.

Would it then make sense to design (and implement) an initial version of the whole compression expansion in corefxlab? That way, all the related APIs can be designed together and easily inform each other.

Contributor

svick commented Jun 13, 2017

@ianhays

My main concern is that I really want to do our compression expansion right. There are a ton of questions around design and usability that I want answered well before we take anything like this issue, #9237, or #9709.

Would it then make sense to design (and implement) an initial version of the whole compression expansion in corefxlab? That way, all the related APIs can be designed together and easily inform each other.

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock Jun 13, 2017

Putting it wholesale together as a real thing in corefxlab makes sense. Evolving the API might be too slow.

adamhathcock commented Jun 13, 2017

Putting it wholesale together as a real thing in corefxlab makes sense. Evolving the API might be too slow.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jun 13, 2017

Member

Would it then make sense to design (and implement) an initial version of the whole compression expansion in corefxlab? That way, all the related APIs can be designed together and easily inform each other.

That is probably the best idea. We can design the large structure in corefxlab, then move small pieces to corefx one-by-one with their own justification so that we can actually get them through API review.

Here's a rough Reader API based on what I've done with SharpCompress and making the Reader itself be IEnumerable might not quite be correct.

I like that too. Your rough API is pretty similar to what I had in mind.

A couple suggestions:

  • ReaderOptions should include an option to specify the ArchiveType explicitly.
  • ArchiveType.GZip should be called TarGZip or GZipTar to avoid confusion.
  • ExtractionOptions should have a PreservePermissions entry.
  • Should there be some relation between IReaderEntry and IEntry? Where is the latter used?
  • Instead of having IReader.WriteEntryTo and friends, I think those should be extension methods on the IReaderEntry (or IEntry) itself.
  • All of the instances of Reader/IReader should be replaced with ArchiveReader or IArchiveReader (or something like that).

On a more general note, could you explain why you split reading and writing to their own interfaces/classes? Having a single IArchive seems more intuitive and usable despite opening up some notsupported operations e.g. attempting to write to an archive that is being read.

Member

ianhays commented Jun 13, 2017

Would it then make sense to design (and implement) an initial version of the whole compression expansion in corefxlab? That way, all the related APIs can be designed together and easily inform each other.

That is probably the best idea. We can design the large structure in corefxlab, then move small pieces to corefx one-by-one with their own justification so that we can actually get them through API review.

Here's a rough Reader API based on what I've done with SharpCompress and making the Reader itself be IEnumerable might not quite be correct.

I like that too. Your rough API is pretty similar to what I had in mind.

A couple suggestions:

  • ReaderOptions should include an option to specify the ArchiveType explicitly.
  • ArchiveType.GZip should be called TarGZip or GZipTar to avoid confusion.
  • ExtractionOptions should have a PreservePermissions entry.
  • Should there be some relation between IReaderEntry and IEntry? Where is the latter used?
  • Instead of having IReader.WriteEntryTo and friends, I think those should be extension methods on the IReaderEntry (or IEntry) itself.
  • All of the instances of Reader/IReader should be replaced with ArchiveReader or IArchiveReader (or something like that).

On a more general note, could you explain why you split reading and writing to their own interfaces/classes? Having a single IArchive seems more intuitive and usable despite opening up some notsupported operations e.g. attempting to write to an archive that is being read.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jun 13, 2017

Member

public interface IEntry

It should be possible to combine this with the concept of a compression algorithm interface a la #9709.

Member

ianhays commented Jun 13, 2017

public interface IEntry

It should be possible to combine this with the concept of a compression algorithm interface a la #9709.

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock Jun 13, 2017

ReaderOptions should include an option to specify the ArchiveType explicitly.

Makes sense to explicitly try an ArchiveType.

ArchiveType.GZip should be called TarGZip or GZipTar to avoid confusion.

That's a way to go. Explicitly identify Tar archives with a matching single file compression.

ExtractionOptions should have a PreservePermissions entry.

Can do.

Should there be some relation between IReaderEntry and IEntry? Where is the latter used?

IEntry is used with IArchive. Just want to reuse the object among both the forward-only API and random access API.

Instead of having IReader.WriteEntryTo and friends, I think those should be extension methods on the IReaderEntry (or IEntry) itself.

Only one of the extensions is for IReader. The others are for IReaderEntry. I guess it's not explicit as they're all in the same extension class.

All of the instances of Reader/IReader should be replaced with ArchiveReader or IArchiveReader (or something like that).

I disagree. Archive is random access (like the current ZipArchive is random access). The forward-only requires a different way of thinking. This goes back to using IEnumerable or not. Archive is a "nicer" API. However, for perf concerns, you should learn a specialized forward-only API (Reader). There should be a matching ArchiveFactory that is random access and can be read/write. This is what exists in SharpCompress. If you try to combine random access and forward-only perf in the same API, you're in for a world of hurt.

It should be possible to combine this with the concept of a compression algorithm interface a la #9709.

I don't think so. You could probably put DeflateStream, LZMAStream, etc behind CompressionStream. IEntry is about reading the metadata of an entry in an archive. Then choosing to read/extract the data or skip it.

adamhathcock commented Jun 13, 2017

ReaderOptions should include an option to specify the ArchiveType explicitly.

Makes sense to explicitly try an ArchiveType.

ArchiveType.GZip should be called TarGZip or GZipTar to avoid confusion.

That's a way to go. Explicitly identify Tar archives with a matching single file compression.

ExtractionOptions should have a PreservePermissions entry.

Can do.

Should there be some relation between IReaderEntry and IEntry? Where is the latter used?

IEntry is used with IArchive. Just want to reuse the object among both the forward-only API and random access API.

Instead of having IReader.WriteEntryTo and friends, I think those should be extension methods on the IReaderEntry (or IEntry) itself.

Only one of the extensions is for IReader. The others are for IReaderEntry. I guess it's not explicit as they're all in the same extension class.

All of the instances of Reader/IReader should be replaced with ArchiveReader or IArchiveReader (or something like that).

I disagree. Archive is random access (like the current ZipArchive is random access). The forward-only requires a different way of thinking. This goes back to using IEnumerable or not. Archive is a "nicer" API. However, for perf concerns, you should learn a specialized forward-only API (Reader). There should be a matching ArchiveFactory that is random access and can be read/write. This is what exists in SharpCompress. If you try to combine random access and forward-only perf in the same API, you're in for a world of hurt.

It should be possible to combine this with the concept of a compression algorithm interface a la #9709.

I don't think so. You could probably put DeflateStream, LZMAStream, etc behind CompressionStream. IEntry is about reading the metadata of an entry in an archive. Then choosing to read/extract the data or skip it.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jun 14, 2017

Member

That's a way to go. Explicitly identify Tar archives with a matching single file compression.

Yeah. It's not ideal though.

IEntry is used with IArchive. Just want to reuse the object among both the forward-only API and random access API.
Only one of the extensions is for IReader. The others are for IReaderEntry. I guess it's not explicit as they're all in the same extension class.

I see, thanks.

This is what exists in SharpCompress. If you try to combine random access and forward-only perf in the same API, you're in for a world of hurt.

To me it seems like a lot of distinct interfaces and classes to protect the programmer from themselves. I'd personally much rather use one interface/class like how ZipArchive can either read or write depending on the stream passed to it.

Member

ianhays commented Jun 14, 2017

That's a way to go. Explicitly identify Tar archives with a matching single file compression.

Yeah. It's not ideal though.

IEntry is used with IArchive. Just want to reuse the object among both the forward-only API and random access API.
Only one of the extensions is for IReader. The others are for IReaderEntry. I guess it's not explicit as they're all in the same extension class.

I see, thanks.

This is what exists in SharpCompress. If you try to combine random access and forward-only perf in the same API, you're in for a world of hurt.

To me it seems like a lot of distinct interfaces and classes to protect the programmer from themselves. I'd personally much rather use one interface/class like how ZipArchive can either read or write depending on the stream passed to it.

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock Jun 15, 2017

Yeah. It's not ideal though.

Then we don't need it! Can already match based on headers.

To me it seems like a lot of distinct interfaces and classes to protect the programmer from themselves.

It's not about protecting, it's about guiding. People often say they like the SharpCompress API better than others because its more explicit. I'm definitely against combining the forward-only and random access scenarios. The use-cases have very different ways they need to be handled and forcing the APIs together feels unnatural. Knowing which methods/properties can be used and when puts a lot of cognitive burden on the user. Fixing this doesn't result in reduced performance and simpler code.

I'd personally much rather use one interface/class like how ZipArchive can either read or write depending on the stream passed to it.

The Archive API does do read/writes in SharpCompress. The Reader/Writer API has a clean break as these actions in a forward-only scenario are very different.

I'm just offloading my experience with SharpCompress. I'm not saying there isn't a better way than the direction I've gone but I don't want the same mistakes repeated.

adamhathcock commented Jun 15, 2017

Yeah. It's not ideal though.

Then we don't need it! Can already match based on headers.

To me it seems like a lot of distinct interfaces and classes to protect the programmer from themselves.

It's not about protecting, it's about guiding. People often say they like the SharpCompress API better than others because its more explicit. I'm definitely against combining the forward-only and random access scenarios. The use-cases have very different ways they need to be handled and forcing the APIs together feels unnatural. Knowing which methods/properties can be used and when puts a lot of cognitive burden on the user. Fixing this doesn't result in reduced performance and simpler code.

I'd personally much rather use one interface/class like how ZipArchive can either read or write depending on the stream passed to it.

The Archive API does do read/writes in SharpCompress. The Reader/Writer API has a clean break as these actions in a forward-only scenario are very different.

I'm just offloading my experience with SharpCompress. I'm not saying there isn't a better way than the direction I've gone but I don't want the same mistakes repeated.

@qmfrederik

This comment has been minimized.

Show comment
Hide comment
@qmfrederik

qmfrederik Jun 15, 2017

Collaborator

I disagree. Archive is random access (like the current ZipArchive is random access). The forward-only requires a different way of thinking. This goes back to using IEnumerable or not. Archive is a "nicer" API. However, for perf concerns, you should learn a specialized forward-only API (Reader). There should be a matching ArchiveFactory that is random access and can be read/write. This is what exists in SharpCompress. If you try to combine random access and forward-only perf in the same API, you're in for a world of hurt.

Some archive types are designed to be read forward-only (such as CPIO or tar archives). On the other hand, zip archives have a central directory at the end of the file, so you can have a very efficient random access API for zip archives, too.

Collaborator

qmfrederik commented Jun 15, 2017

I disagree. Archive is random access (like the current ZipArchive is random access). The forward-only requires a different way of thinking. This goes back to using IEnumerable or not. Archive is a "nicer" API. However, for perf concerns, you should learn a specialized forward-only API (Reader). There should be a matching ArchiveFactory that is random access and can be read/write. This is what exists in SharpCompress. If you try to combine random access and forward-only perf in the same API, you're in for a world of hurt.

Some archive types are designed to be read forward-only (such as CPIO or tar archives). On the other hand, zip archives have a central directory at the end of the file, so you can have a very efficient random access API for zip archives, too.

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock Jun 15, 2017

Some archive types are designed to be read forward-only (such as CPIO or tar archives). On the other hand, zip archives have a central directory at the end of the file, so you can have a very efficient random access API for zip archives, too.

Trust me, I know. I've been dealing and creating SharpCompress for almost a decade.

For random access APIs (Archive): If there is no central directory, then you scan the file. If the sizes of the entries are known, then you can just seek to headers and it's not a big deal.

Zip is the only format that really does have a central directory. 7Zip has funky stuff (and I hate it ) and everything else is basically <header><data><header><data> (RAR, Tar, etc.) so they need to be scanned.

My point is that for very large files in certain cases you want a forward-only API to for performance reasons. Everything except 7Zip and Zip64 (when writing) can be done this way. Zip has post data descriptors when writing out (though Zip64 doesn't support this). Tar can be written out as a forward-only archive if you know the data size for the header.

adamhathcock commented Jun 15, 2017

Some archive types are designed to be read forward-only (such as CPIO or tar archives). On the other hand, zip archives have a central directory at the end of the file, so you can have a very efficient random access API for zip archives, too.

Trust me, I know. I've been dealing and creating SharpCompress for almost a decade.

For random access APIs (Archive): If there is no central directory, then you scan the file. If the sizes of the entries are known, then you can just seek to headers and it's not a big deal.

Zip is the only format that really does have a central directory. 7Zip has funky stuff (and I hate it ) and everything else is basically <header><data><header><data> (RAR, Tar, etc.) so they need to be scanned.

My point is that for very large files in certain cases you want a forward-only API to for performance reasons. Everything except 7Zip and Zip64 (when writing) can be done this way. Zip has post data descriptors when writing out (though Zip64 doesn't support this). Tar can be written out as a forward-only archive if you know the data size for the header.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jun 16, 2017

Member

Then we don't need it! Can already match based on headers.

I was more thinking about having it for the Writer.

It's not about protecting, it's about guiding. People often say they like the SharpCompress API better than others because its more explicit. I'm definitely against combining the forward-only and random access scenarios. The use-cases have very different ways they need to be handled and forcing the APIs together feels unnatural. Knowing which methods/properties can be used and when puts a lot of cognitive burden on the user. Fixing this doesn't result in reduced performance and simpler code.

I definitely don't disagree with you. I personally have always disliked having the same class for writing/reading because it's more difficult to reason about. I don't even like the words 'Write'/'Read' being used in DeflateStream/GZipStream, but they're necessary to implement Stream.

I mostly just want to make sure we're avoiding unnecessary duplication and/or API's with low usage from redundancy.

You've convinced me that we should at least have an IForwardArchive and an IArchive.

If I was designing this from scratch I would probably aim to have something like (very rough):

    public interface IForwardArchive : IEnumerable<IArchiveEntry>
    {
        // Reader functions
        IEnumerator<IArchiveEntry> GetEnumerator();
        IReaderEntry CurrentEntry { get; }

        // Writer functions
        void Write(string filename, Stream source, DateTime? modificationTime);

        // Shared functions
        ArchiveType Type { get; }
        void Dispose();
    }

    public interface IArchive : IForwardArchive
    {
        IEnumerable<IArchiveEntry> Entries { get; }
        IEnumerable<IVolume> Volumes { get; }

        bool IsSolid { get; }
        bool IsComplete { get; }
        long TotalSize { get; }
        long TotalUncompressSize { get; }
    }
Member

ianhays commented Jun 16, 2017

Then we don't need it! Can already match based on headers.

I was more thinking about having it for the Writer.

It's not about protecting, it's about guiding. People often say they like the SharpCompress API better than others because its more explicit. I'm definitely against combining the forward-only and random access scenarios. The use-cases have very different ways they need to be handled and forcing the APIs together feels unnatural. Knowing which methods/properties can be used and when puts a lot of cognitive burden on the user. Fixing this doesn't result in reduced performance and simpler code.

I definitely don't disagree with you. I personally have always disliked having the same class for writing/reading because it's more difficult to reason about. I don't even like the words 'Write'/'Read' being used in DeflateStream/GZipStream, but they're necessary to implement Stream.

I mostly just want to make sure we're avoiding unnecessary duplication and/or API's with low usage from redundancy.

You've convinced me that we should at least have an IForwardArchive and an IArchive.

If I was designing this from scratch I would probably aim to have something like (very rough):

    public interface IForwardArchive : IEnumerable<IArchiveEntry>
    {
        // Reader functions
        IEnumerator<IArchiveEntry> GetEnumerator();
        IReaderEntry CurrentEntry { get; }

        // Writer functions
        void Write(string filename, Stream source, DateTime? modificationTime);

        // Shared functions
        ArchiveType Type { get; }
        void Dispose();
    }

    public interface IArchive : IForwardArchive
    {
        IEnumerable<IArchiveEntry> Entries { get; }
        IEnumerable<IVolume> Volumes { get; }

        bool IsSolid { get; }
        bool IsComplete { get; }
        long TotalSize { get; }
        long TotalUncompressSize { get; }
    }
@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock Jun 17, 2017

I was more thinking about having it for the Writer.

Yes, with a writer you have to specify an archive type.

I mostly just want to make sure we're avoiding unnecessary duplication and/or API's with low usage from redundancy.

You've convinced me that we should at least have an IForwardArchive and an IArchive.

If I was designing this from scratch I would probably aim to have something like (very rough):

I don't think that layout quite works. You shouldn't have CurrentEntry on IArchive. It doesn't have the concept of a current entry because it's random access. There is very little shared between IArchive and IForwardArchive really. The concepts just don't match much other than you're dealing with entries. One case is a only an enumerable the other is a collection.

You also want to remove Write from the forward-only reading functions. For a forward-only writer, you want to explicitly create an entry each time and pass a byte stream to the writer to handle a potential large stream or network stream just one at a time. Again, mixing writing and reading in a forward-only context here just creates confusion in the API.

I don't even like the words 'Write'/'Read' being used in DeflateStream/GZipStream, but they're necessary to implement Stream.

I don't like Read/Write being on Stream at the same time either. One thing the Java API got correct was separating it :) Having to always check CanWrite or CanRead can be burdensome.

adamhathcock commented Jun 17, 2017

I was more thinking about having it for the Writer.

Yes, with a writer you have to specify an archive type.

I mostly just want to make sure we're avoiding unnecessary duplication and/or API's with low usage from redundancy.

You've convinced me that we should at least have an IForwardArchive and an IArchive.

If I was designing this from scratch I would probably aim to have something like (very rough):

I don't think that layout quite works. You shouldn't have CurrentEntry on IArchive. It doesn't have the concept of a current entry because it's random access. There is very little shared between IArchive and IForwardArchive really. The concepts just don't match much other than you're dealing with entries. One case is a only an enumerable the other is a collection.

You also want to remove Write from the forward-only reading functions. For a forward-only writer, you want to explicitly create an entry each time and pass a byte stream to the writer to handle a potential large stream or network stream just one at a time. Again, mixing writing and reading in a forward-only context here just creates confusion in the API.

I don't even like the words 'Write'/'Read' being used in DeflateStream/GZipStream, but they're necessary to implement Stream.

I don't like Read/Write being on Stream at the same time either. One thing the Java API got correct was separating it :) Having to always check CanWrite or CanRead can be burdensome.

@am11

This comment has been minimized.

Show comment
Hide comment
@am11

am11 Jul 8, 2017

Contributor

Should this proposal be considered in conjunction with the upcoming System.IO.Pipeline: https://github.com/dotnet/corefxlab/blob/091d12e/docs/specs/pipelines-io.md?

There seems to be compression APIs in place based on pipelines as well, under System.IO.Pipelines.Compression namespace: https://github.com/dotnet/corefxlab/tree/091d12e/src/System.IO.Pipelines.Compression

Contributor

am11 commented Jul 8, 2017

Should this proposal be considered in conjunction with the upcoming System.IO.Pipeline: https://github.com/dotnet/corefxlab/blob/091d12e/docs/specs/pipelines-io.md?

There seems to be compression APIs in place based on pipelines as well, under System.IO.Pipelines.Compression namespace: https://github.com/dotnet/corefxlab/tree/091d12e/src/System.IO.Pipelines.Compression

@karelz

This comment has been minimized.

Show comment
Hide comment
@karelz

karelz Jul 9, 2017

Member

It is currently unclear if System.IO.Pipeline will make it into CoreFX or not. There are still active discussions around that. It may just sit on top of CoreFX.

If there are Compression investments in the Pipelines space, I would hope they are considered to be done in lower layers (System.IO.Compression) first, so that more people/apps can benefit from the investments. @joshfree @KrzysztofCwalina can you please comment?

Member

karelz commented Jul 9, 2017

It is currently unclear if System.IO.Pipeline will make it into CoreFX or not. There are still active discussions around that. It may just sit on top of CoreFX.

If there are Compression investments in the Pipelines space, I would hope they are considered to be done in lower layers (System.IO.Compression) first, so that more people/apps can benefit from the investments. @joshfree @KrzysztofCwalina can you please comment?

@joshfree

This comment has been minimized.

Show comment
Hide comment
@joshfree

joshfree Jul 10, 2017

Member

If there are Compression investments in the Pipelines space, I would hope they are considered to be done in lower layers (System.IO.Compression) first, so that more people/apps can benefit from the investments.

Yes, any actual compression investments are being done at a lower level. For instance one of our summer interns is working on System.IO.Compression.Brotli for a future release in .NET Core @ https://github.com/dotnet/corefxlab/tree/master/src/System.IO.Compression.Brotli.

Member

joshfree commented Jul 10, 2017

If there are Compression investments in the Pipelines space, I would hope they are considered to be done in lower layers (System.IO.Compression) first, so that more people/apps can benefit from the investments.

Yes, any actual compression investments are being done at a lower level. For instance one of our summer interns is working on System.IO.Compression.Brotli for a future release in .NET Core @ https://github.com/dotnet/corefxlab/tree/master/src/System.IO.Compression.Brotli.

@twsouthwick

This comment has been minimized.

Show comment
Hide comment
@twsouthwick

twsouthwick Nov 16, 2017

Member

We have encountered a few issues in System.IO.Packaging that users of DocumentFormat.OpenXml (a common Office document manipulation library) have hit that require streaming support in zip files. I'm not familiar with the domain enough to add much to the API discussion, but would love to see some movement on this issue.

Member

twsouthwick commented Nov 16, 2017

We have encountered a few issues in System.IO.Packaging that users of DocumentFormat.OpenXml (a common Office document manipulation library) have hit that require streaming support in zip files. I'm not familiar with the domain enough to add much to the API discussion, but would love to see some movement on this issue.

@DavidThi808

This comment has been minimized.

Show comment
Hide comment
@DavidThi808

DavidThi808 Nov 16, 2017

I want to add my voice agreeing with @twsouthwick - this is critically needed for the OpenXML library.

thanks - dave

DavidThi808 commented Nov 16, 2017

I want to add my voice agreeing with @twsouthwick - this is critically needed for the OpenXML library.

thanks - dave

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment