-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Forward-only API for System.IO.Compression #1550
Comments
@ianhays what is the next step? You did +1 above, does it mean you like the idea? Do we need API proposal? Or is it ready for API review? (Disclaimer: I just skimmed through it) |
I can put together a new API based on what I've done. It would be very similar to SharpCompress. |
|
|
Then maybe the API should return a type that can be used in a |
I've gone back-and-forth about this myself. I opened https://github.com/dotnet/corefx/issues/9709 to track my thoughts long-term on a way to centralize our compression algorithms somehow. It's clear that as we continue adding more compression types to corefx it's going to become trickier and tricker to integrate them smoothly and expose them to devs of all skill levels without being overwhelming or overly simplistic.
I think in this case increased usability is more valuable than protecting devs from themselves. You could implement a code analysis rule specifically to look for instances where devs do a |
https://github.com/dotnet/corefx/issues/9237 is also somewhat related, though less so. |
I do like the idea. My main concern is that I really want to do our compression expansion right. There are a ton of questions around design and usability that I want answered well before we take anything like this issue, dotnet/corefx#9237, or dotnet/corefx#9709. Adding something like LZMA is easy because we can model it after DeflateStream/GZipStream with a similar API, but this issue (and the others linked above) is far more complex in the long-term implications of its design and API expansion. |
OK, what is the next step? (you coming back with larger design?) What is rough ETA? |
Continue discussing design choices in this issue, dotnet/corefx#9709, and dotnet/corefx#9237. Wait for other people to be interested and comment/vote on these threads with their opinion.
Whenever we reach quorum and have a sizable mass of users requesting the feature. It's not something that we're hurting for right now or that a lot of customers are asking for, so I don't feel comfortable assigning an ETA for it. |
Here's a rough Reader API based on what I've done with SharpCompress and making the Reader itself be //The file format of the Archive. Can be confusing as GZip is a single file format as well as a compression type (using Tar as archive type).
public enum ArchiveType
{
Rar,
Zip,
Tar,
SevenZip,
GZip
}
//The compression type used on the Archive/Entry. GZip,BZip2,LZip,Xz are single file formats as well as a compression type.
public enum CompressionType
{
None,
GZip,
BZip2,
PPMd,
Deflate,
Rar,
LZMA,
BCJ,
BCJ2,
LZip,
Xz,
Unknown
}
//possibly not needed
public class ReaderOptions
{
//Password protected archives (out of scope?)
public string Password { get; set; }
//Take ownership of the stream
public bool CloseStream { get; set; }
}
public static class ReaderFactory
{
public static IReader Open(Stream stream, ReaderOptions options = null);
#if NETSTANDARD1_3
public static IReader Open(FileInfo file, ReaderOptions options = null);
public static IReader Open(string filePath, ReaderOptions options = null);
#endif
}
public interface IReader : IEnumerable<IReaderEntry>
{
IEnumerator<IReaderEntry> GetEnumerator();
ArchiveType ArchiveType { get; }
IReaderEntry CurrentEntry { get; }
void Dispose();
}
public interface IReaderEntry
{
Stream OpenRead();
}
public interface IEntry
{
//This is on entry because some archive types (like ZIP) can have compression types different per entry
CompressionType CompressionType { get; }
string Key { get; }
//everything below here is optional depending on archive type
DateTime? ArchivedTime { get; }
long? CompressedSize { get; }
long? Crc { get; }
long? Size { get; }
DateTime? CreatedTime { get; }
bool? IsDirectory { get; }
bool? IsEncrypted { get; }
bool? IsSplit { get; }
DateTime? LastAccessedTime { get; }
DateTime? LastModifiedTime { get; }
}
public static class IReaderExtensions
{
#if NETSTANDARD1_3
public static void WriteEntryTo(this IReaderEntry reader, string filePath);
/// <summary>
/// Extract all remaining unread entries to specific directory, retaining filename
/// </summary>
public static void WriteAllToDirectory(this IReader reader, string destinationDirectory,
ExtractionOptions options = null);
/// <summary>
/// Extract to specific directory, retaining filename
/// </summary>
public static void WriteEntryToDirectory(this IReaderEntry reader, string destinationDirectory,
ExtractionOptions options = null);
/// <summary>
/// Extract to specific file
/// </summary>
public static void WriteEntryToFile(this IReaderEntry reader, string destinationFileName,
ExtractionOptions options = null);
#endif
}
[Flags]
public enum ExtractionOptions
{
None,
/// <summary>
/// overwrite target if it exists
/// </summary>
Overwrite,
/// <summary>
/// extract with internal directory structure
/// </summary>
ExtractFullPath,
/// <summary>
/// preserve file time
/// </summary>
PreserveFileTime,
/// <summary>
/// preserve windows file attributes
/// </summary>
PreserveAttributes,
} Sample usage public static void Main()
{
using(var reader = ReaderFactory.Open("C:\\test.zip"))
{
foreach(var entry in reader)
{
if (!entry.IsDirectory)
{
entry.WriteEntryToDirectory("D:\\temp", ExtractionOptions.ExtractFullPath);
}
}
}
using(var reader = ReaderFactory.Open("C:\\test.zip"))
{
reader.WriteAllToDirectory("D:\\temp2", ExtractionOptions.ExtractFullPath);
}
//no explicit using statement for IReader
foreach(var entry in ReaderFactory.Open("C:\\test.zip")))
{
if (entry.Key.StartsWith("x"))
{
using(var data = entry.OpenRead())
{
data.CopyTo(new MemoryStream());
}
}
}
} Can expand out to the show the generic Archive API too. |
Would it then make sense to design (and implement) an initial version of the whole compression expansion in corefxlab? That way, all the related APIs can be designed together and easily inform each other. |
Putting it wholesale together as a real thing in corefxlab makes sense. Evolving the API might be too slow. |
That is probably the best idea. We can design the large structure in corefxlab, then move small pieces to corefx one-by-one with their own justification so that we can actually get them through API review.
I like that too. Your rough API is pretty similar to what I had in mind. A couple suggestions:
On a more general note, could you explain why you split reading and writing to their own interfaces/classes? Having a single |
It should be possible to combine this with the concept of a compression algorithm interface a la https://github.com/dotnet/corefx/issues/9709. |
Makes sense to explicitly try an ArchiveType.
That's a way to go. Explicitly identify Tar archives with a matching single file compression.
Can do.
IEntry is used with IArchive. Just want to reuse the object among both the forward-only API and random access API.
Only one of the extensions is for IReader. The others are for IReaderEntry. I guess it's not explicit as they're all in the same extension class.
I disagree. Archive is random access (like the current ZipArchive is random access). The forward-only requires a different way of thinking. This goes back to using IEnumerable or not. Archive is a "nicer" API. However, for perf concerns, you should learn a specialized forward-only API (Reader). There should be a matching ArchiveFactory that is random access and can be read/write. This is what exists in SharpCompress. If you try to combine random access and forward-only perf in the same API, you're in for a world of hurt.
I don't think so. You could probably put DeflateStream, LZMAStream, etc behind CompressionStream. IEntry is about reading the metadata of an entry in an archive. Then choosing to read/extract the data or skip it. |
Yeah. It's not ideal though.
I see, thanks.
To me it seems like a lot of distinct interfaces and classes to protect the programmer from themselves. I'd personally much rather use one interface/class like how ZipArchive can either read or write depending on the stream passed to it. |
Then we don't need it! Can already match based on headers.
It's not about protecting, it's about guiding. People often say they like the SharpCompress API better than others because its more explicit. I'm definitely against combining the forward-only and random access scenarios. The use-cases have very different ways they need to be handled and forcing the APIs together feels unnatural. Knowing which methods/properties can be used and when puts a lot of cognitive burden on the user. Fixing this doesn't result in reduced performance and simpler code.
The Archive API does do read/writes in SharpCompress. The Reader/Writer API has a clean break as these actions in a forward-only scenario are very different. I'm just offloading my experience with SharpCompress. I'm not saying there isn't a better way than the direction I've gone but I don't want the same mistakes repeated. |
Some archive types are designed to be read forward-only (such as CPIO or tar archives). On the other hand, zip archives have a central directory at the end of the file, so you can have a very efficient random access API for zip archives, too. |
Trust me, I know. I've been dealing and creating SharpCompress for almost a decade. For random access APIs (Archive): If there is no central directory, then you scan the file. If the sizes of the entries are known, then you can just seek to headers and it's not a big deal. Zip is the only format that really does have a central directory. 7Zip has funky stuff (and I hate it ) and everything else is basically My point is that for very large files in certain cases you want a forward-only API to for performance reasons. Everything except 7Zip and Zip64 (when writing) can be done this way. Zip has post data descriptors when writing out (though Zip64 doesn't support this). Tar can be written out as a forward-only archive if you know the data size for the header. |
I was more thinking about having it for the Writer.
I definitely don't disagree with you. I personally have always disliked having the same class for writing/reading because it's more difficult to reason about. I don't even like the words 'Write'/'Read' being used in DeflateStream/GZipStream, but they're necessary to implement Stream. I mostly just want to make sure we're avoiding unnecessary duplication and/or API's with low usage from redundancy. You've convinced me that we should at least have an IForwardArchive and an IArchive. If I was designing this from scratch I would probably aim to have something like (very rough): public interface IForwardArchive : IEnumerable<IArchiveEntry>
{
// Reader functions
IEnumerator<IArchiveEntry> GetEnumerator();
IReaderEntry CurrentEntry { get; }
// Writer functions
void Write(string filename, Stream source, DateTime? modificationTime);
// Shared functions
ArchiveType Type { get; }
void Dispose();
}
public interface IArchive : IForwardArchive
{
IEnumerable<IArchiveEntry> Entries { get; }
IEnumerable<IVolume> Volumes { get; }
bool IsSolid { get; }
bool IsComplete { get; }
long TotalSize { get; }
long TotalUncompressSize { get; }
} |
Yes, with a writer you have to specify an archive type.
I don't think that layout quite works. You shouldn't have You also want to remove
I don't like Read/Write being on Stream at the same time either. One thing the Java API got correct was separating it :) Having to always check |
Should this proposal be considered in conjunction with the upcoming System.IO.Pipeline: https://github.com/dotnet/corefxlab/blob/091d12e/docs/specs/pipelines-io.md? There seems to be compression APIs in place based on pipelines as well, under System.IO.Pipelines.Compression namespace: https://github.com/dotnet/corefxlab/tree/091d12e/src/System.IO.Pipelines.Compression |
It is currently unclear if If there are Compression investments in the Pipelines space, I would hope they are considered to be done in lower layers ( |
Yes, any actual compression investments are being done at a lower level. For instance one of our summer interns is working on System.IO.Compression.Brotli for a future release in .NET Core @ https://github.com/dotnet/corefxlab/tree/master/src/System.IO.Compression.Brotli. |
We have encountered a few issues in System.IO.Packaging that users of DocumentFormat.OpenXml (a common Office document manipulation library) have hit that require streaming support in zip files. I'm not familiar with the domain enough to add much to the API discussion, but would love to see some movement on this issue. |
I want to add my voice agreeing with @twsouthwick - this is critically needed for the OpenXML library. thanks - dave |
In the case of |
MOved to future. This is not a must fix for 3.0 |
Triage: |
Agreed with the API similar to Tar! The problem with Zip is that the format is streamable but not all Zips are created with the required header info to make them stream-readable. I just throw exceptions in Sharpcompress in this case. The current implementation of Zip requires using the dictionary at the end. GZipStream is also a problem because the current implementation doesn't expose the file headers/metadata to use it properly as an archive format. (Related to dotnet/corefx#7570 ) |
I have a library that parses XLSX files which would have made good use of this. |
I like the general idea, despite of the problems with ZIP format. In most environments, one might be able to constrain itself to compliant ZIPs, and it's very useful for TAR , too. However, some of the methods might need Async variants, I guess, especially |
Wanted to add a bit more information on this topic - I am one of the authors' of Microsoft SimpleRemote that we use for testing in Surface, and we ended up using a 3rd party tar library because there was no way to do forward-only operations with what's included in the current Compression namespace. This is less of an API defect and more of a problem with the zip format itself - Zip is easy to stream when writing to it, because it stores metadata for the files at the end of the archive, so you can easily write to the stream without ever seeking. Reading, however, is incredibly difficult, because you need to read end of the file before operating on the archive. To address this, if you try to open a non-seekable stream with ZipArchive, .NET internally reads the entire stream into a memory stream before processing (this works in some cases, but does not work on zip files larger than 4GB, as memory stream does not support streams larger than int32 bytes). Here's an example of where this is a problem: Let's say you're reading a zip stream over a network - this means you'd need to download the entire file before being able to access any of the contents. If we compare this to tar (or cpio, or any tar-like format), we don't have this restriction - we can freely read elements as they come in, since all metadata for a given file is just before that file. This is a problem for any case where the source stream does not support seeking (if we're encrypting data using cryptostreams, for example, they also aren't seekable). Adding tar, or another archive that supports forward only reads and writes, would make these scenarios much more straightforward and performant. One note, though - if we implement tar, please use something that, at minimum, supports USTAR (and, ideally, supports the 2001 PAX extensions, so we can write files with unlimited size - historic tar has an 8GB/entry cap). |
SharpCompress does what you need but the Tar support isn’t super great. Would love help expanding it. |
@adamhathcock I'll take a look! From a quick glance at the code, it looks like SharpCompress Tar entries are USTAR + GNU extensions, correct? It should be pretty easy to add basic support for PAX extensions for long-file, long-link, and large sizes. |
Is there an update on this issue? This issue affects my use of the OpenXML SDK. |
This was started in https://github.com/dotnet/corefx/issues/9657
There seems to be a growing desire/need for a forward-only API that accesses compressed file formats (e.g. zip, gzip, tar, etc.) in a streaming manner. This means very large files as well as streams like network streams can be read and decompressed on the fly. Basically, the API reads from Stream objects and never seeks on it. This is how the Reader/Writer API from SharpCompress works.
Here's a sample from the unit tests:
WriteEntryToDirectory
is an extension method that provides some shortcuts for dealing with file operations but what it really does is just grab the internal stream and decompresses. The actual entry method is justIReader.WriteEntryTo(Stream);
If the entry isn't decompressed then the internal stream is just moved forward and not decompressed if possible (some formats require decompression since compressed length can be unknown)The Writer API works similarly.
There is also a generic API from
ReaderFactory
orWriterFactory
that doesn't require knowledge of the format beforehand. There is also a similarArchiveFactory
that is writeable (that uses WriterFactory internally to save) that could also be used for the currentZipArchive
API and beyond.As the author of SharpCompress, I'd like to push a lot of the ideas into core library but having native access to the compression streams (like the internal zlib) would be a great performance benefit for me. I haven't ever written any compression algorithm implementations myself so I'm sure my managed implementations need a lot of love.
I would start by creating
ZipReader
andZipWriter
(as well as starting the generic API) using a lot of the internal code already in the core library to prove out the API. This kind of relates to https://github.com/dotnet/corefx/issues/14853 but forward-only access is something most libraries don't account for so I'm not sure. Other easy additions would be Reader/Writer support for GZip, BZip2 and LZip (with an LZMA compressor).Tar support linked with the previous single file formats would be a larger addition. SharpCompress deals with
tar.gz
,tar.bz2
andtar.lz
auto-magically by detecting first the compressed format then a tar file inside. The above API works the same.Thoughts?
Summoning a few people from the other issue
cc: @ianhays @karelz @qmfrederik
The text was updated successfully, but these errors were encountered: