-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for 7z/LZMA1/LZMA2 to System.IO.Compression #1542
Comments
LZMA is pretty much the state of the art general purpose compression algorithm. It pushes the Pareto Frontier outwards in quite a large range. If there is interest in adding more algorithms there would be room for a super fast one that compresses in the range 10-300MB/s. Such algorithms exist and can be found using any search engine. |
I've already ported the C# impl for internal use in CLI: https://github.com/dotnet/cli/tree/rel/1.0.0/src/Microsoft.DotNet.Archive/LZMA The work to do to pull this in is the following:
For perf we could do a similar thing to what was done for ZLIB where we just pick up the native impl and add that to CLRCompression. @GSPP if you have requests for other algos please file a separate issue. |
Something we should consider before we start working on adding support for Tar/Deflate/7z/LZMA/LZMA2/LZ4/BZ2/etc is how we want to design the APIs for the classes to keep them relatively similar without cluttering up Sys.IO.Compression. Options there off the top of my head:
I imagine we'll want to go with the first option so we can provide fine-grained control of the algorithm variables, but its worth considering nonetheless. |
First option is what I was thinking. I think we can look at higher-level simple to use archive APIs that might allow for selection of the compression algorithm for an archive where the caller doesn't necessarily need to know about the specific algo type or params but we'd also want to support fine-grained composition of these things to allow for maximum flexibility/tuning. Also for reading APIs should be able to "sniff" a data stream to determine what it is (if the format allows for that deterministically) without forcing the user to tell us via type instantiation. |
I think this would be the most happy compromise between full functionality and ease of use. I'm picturing a bunch of algorithm-specific streams (DeflateStream/GZipStream/LZMAStream) as well as one high-level stream (CompressionStream) that has simple write/read functionality using the defaults of the chosen compression type e.g.
That would be ideal. The algorithm-specific APIs (e.g. DeflateStream) should require the given stream adhere to the format, but the higher-level API should attempt to detect the header type and default (probably to Deflate) it it can't be determined. |
Seems reasonable so long as it doesn't hurt perf (eg: cause an additional buffer copy). I'd say that's a separate feature, but definitely one worth looking at. |
Agreed. I opened https://github.com/dotnet/corefx/issues/9709 to track the discussion on that. |
I think the enum approach suffers from non-extensibility. Wouldn't it be better to use one derived type for each algorithm? Algorithms have different parameters and sometimes different capabilities. For example, Zip has the capability to add named entries (files). Many compression libraries offer a ZipStream that behaves like a normal stream plus the ability to define files. Proposal: Make Also, I do not see in what way a class-based approach (as opposed to an enum-based approach) would be inferior(?). Seems equal or better in every regard (apart from format detection). I advise against format detection when decompressing. The format should be statically known and usually is. Some formats are headerless which breaks the scheme. |
Good point. I don't see how
I don't think we disagree. @ianhays was talking about some convenience construction pattern that would use the public Let's use https://github.com/dotnet/corefx/issues/9709 to track coming up with a convenience construction pattern and use this issue for describing the new streams we intend to add. |
Back to the LZMA discussion: Not sure how we're going to want to do this since the LZMA SDK includes implementations in C as well as C#. Some options and related thoughts:
|
Ideally, it would be a native code solution where possible with a managed fallback. How much slower is the C# version? My guess would be 2-3x. C compilers excel at this kind of code. |
Alright, I've got native and managed versions working in Sys.IO.Compression on Windows so we can get some early perf results. Properties are set to the defaults for both.
Note that the native lzma here is doing File IO itself, so it is going to be a little bit faster than what the end-product would be since we'd want to do the IO in C# so LZMA could follow the Stream pattern. |
I wonder why native and managed have so different results. |
I expect the implementations are quite different. I also realize now that the above results are using the property values that Eric used in his CLI port. I re-ran the tests using the default values and edited the post with those results. Looks like the main difference is that managed results for small files are much faster now. |
Yeah, those parameters were tuned for maximum compression of the CLI payload. I believe the C & C++ implementations are multi-threaded so that makes a big difference. We could do the work to make the C# implementation multi-threaded and measure the difference. |
The results do seem to suffer from different implementations, and it does seem to make the "..the lzma native implementation is significantly faster." conclusion an unfair one. Unless I have completely lost the ability to read a table, there are a number of results where the managed implementation is much faster than the native one. xargs.x1 being an example where ratio is almost as big as the kennedy.xls case, but in the other direction. It just impacts the overall numbers less because it's a smaller input file. It seems that as long as you are only doing decompression, more often than not you are actually better off to use the managed implementation. |
Agreed. The earlier set of results that brought that conclusion were made using Eric's parameters, not the default parameters. As a result, we weren't comparing apples-to-apples. I'll strikethrough my original comment for clarity to people just coming into this thread.
That would be interesting. I'll look into it. I expect it's a fairly large amount of effort, but for large files it would likely be worthwhile. |
The operations being multithreaded that we care about for perf are match finding - about ~30% of the time is spent there: The native LZMA parallelizes these operations when multi-threading is enabled (which it is by default). As far as getting C# to do the same, it would require a not unreasonable amount of effort. The C# match finder is essentially a rewrite of the C non-mt match finder, so it would follow that the C# multi-threaded match finder could logically succeed as a rewrite of the C multi-threaded match finder, at least as a baseline. |
Compression should by default be single threaded. The threading architecture is not for a compression stream to decide. It's a useful facility to be able to turn this on, though. There seem to be two multithreading techniques in LZMA(2): 1. The match finder which apparently does not support a high DOP. In the UI I think the DOP maxes out at 2. You can see this from the way memory usage increases when you increase the DOP. 2. Parallel compression of chunks. The latter one is infinitely scalable but should have an impact on ratio. |
That may be so, but the native LZMA is deciding. It is using 2 threads for match finding, at least on my machine in my tests. It would be useful to test it single-threaded to compare to C# on a more even ground - I'll try to find some time to do that.
According to my tests, match finding being multi-threaded would give us the biggest "bang for our buck" so to speak to bridge the performance gap in the slower cases so I'd prioritize that first. |
@ianhays Hi, I was just wondering on the state of this issue? is there any pre-release/alpha/beta implementations that could be tested now? |
Not quite. I had to switch off to work on some other things, but I'm hoping to get back to compression for 1.1 or 1.2. You're more than welcome to take a stab at it if you'd like :) |
Probably the best way to tackle this would be to add the C# code Eric linked for LZMA to System.IO.Compression and modify the API to be stream-based and similar to DeflateStream. The default values should also be changed from what is in Eric's code. That would be the fastest and simplest way to get LZMA into System.IO.Compression and the initial benchmarks show it being comparable in performance to the more complicated (and difficult to deploy to other platforms) native implementation. If anyone wants to play around with the perf, I've got the code for both managed and native implementations ported already in my fork An official API suggestion would also be helpful to get things moving. |
@ianhays are you working on it? (it's assigned to you) Or is it up for grabs? |
LZ4 would be nice too, there's already nuget in pure C# with friendly licensing https://github.com/MiloszKrajewski/K4os.Compression.LZ4. ZStandard is also interesting: https://github.com/bp74/Zstandard.Net, but it is not in pure C# yet. |
I saw that windows11 is going to support rar / 7zip / etc. |
Really? That would be a surprise if W11 added native support for those formats. I think rar is a commercial non-open-source format. I guess 7zip is open-source, but it will still be a surprise if MS will implement it natively. What happens when Igor Pavlov make changes to 7zip? I don't think MS will go down that rapid hole. Just thinking out loud. |
https://www.neowin.net/news/opening-rar-files-natively-in-windows-11-is-coming-and-people-online-are-going-crazy-over-it/ |
The Windows 11 update which supports RAR and 7zip has been released. How/when can we use that in .NET ? |
@Symbai, #92763 (comment) may be relevant here. |
Seeing as .NET 8 has just been released and planning for the next version is taking place, is there any chance we could see this prioritised for .NET 9? |
I need this so hard on .NET 8 cause brotli with smallest size is too slow and very bad compressing compared with 7z with lzma2 in ultra |
@Mrgaton Agreed, there are no good LZMA/2 libraries and I've has several instances where it would have been very nice to have one. |
Hello, I'm guessing the window has passed for this to have a chance at being included in .NET 9 later this year? |
Yeah there's zero chance of that happening unfortunately. A better question would be, how much customer engagement needs to happen before this can be prioritised. I'm not trying to be demanding or arrogant and I understand each release only has so much capacity and there are other features the .NET team think is more worth while, but this issue is almost 10 years old now and its 4th most top thumbed up issue on the entire repository. We're told that thumbing up and engaging with issues on GitHub is the best way to see issues/features get looked at. I feel like if that statement is true then this issue should have been picked up a long time ago? Can someone from the .NET team help me understand what specifically needs to happen to get this prioritised in the next available release? Is it more thumbs up? Is it more reports of customers use cases being blocked because LZMA isn't available? |
bro it's a pain to use, I can't find documentation anywhere |
Completely agree , we should add all the newer compression algorithms and make then easy to use like existing brotli and deflate but with the possibility of completely configuring the options. |
http://7-zip.org/sdk.html
The text was updated successfully, but these errors were encountered: