Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for 7z/LZMA1/LZMA2 to System.IO.Compression #1542

Open
Tracked by #62658
joshfree opened this issue Jun 24, 2016 · 52 comments
Open
Tracked by #62658

Add support for 7z/LZMA1/LZMA2 to System.IO.Compression #1542

joshfree opened this issue Jun 24, 2016 · 52 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression
Milestone

Comments

@joshfree
Copy link
Member

http://7-zip.org/sdk.html

@GSPP
Copy link

GSPP commented Jun 24, 2016

LZMA is pretty much the state of the art general purpose compression algorithm. It pushes the Pareto Frontier outwards in quite a large range.

If there is interest in adding more algorithms there would be room for a super fast one that compresses in the range 10-300MB/s. Such algorithms exist and can be found using any search engine.

@ericstj
Copy link
Member

ericstj commented Jun 24, 2016

I've already ported the C# impl for internal use in CLI: https://github.com/dotnet/cli/tree/rel/1.0.0/src/Microsoft.DotNet.Archive/LZMA

The work to do to pull this in is the following:

  1. Better API design to follow wrapping stream design used by DeflateStream
  2. Perf

For perf we could do a similar thing to what was done for ZLIB where we just pick up the native impl and add that to CLRCompression.

@GSPP if you have requests for other algos please file a separate issue.

@ianhays
Copy link
Contributor

ianhays commented Jun 27, 2016

Something we should consider before we start working on adding support for Tar/Deflate/7z/LZMA/LZMA2/LZ4/BZ2/etc is how we want to design the APIs for the classes to keep them relatively similar without cluttering up Sys.IO.Compression. Options there off the top of my head:

  • Our current design is to give each compression method its own class (DeflateStream/GZipStream). The most consistent option would be to continue this trend for new algorithms e.g. LZMAStream/ZLibStream/BZ2Stream
  • Have a CompressionStream that takes an optional CompressionType enum (Deflate/Zlib/LZMA/etc) as a constructor param
  • Some blend of the above two.

I imagine we'll want to go with the first option so we can provide fine-grained control of the algorithm variables, but its worth considering nonetheless.

@ericstj
Copy link
Member

ericstj commented Jun 27, 2016

First option is what I was thinking. I think we can look at higher-level simple to use archive APIs that might allow for selection of the compression algorithm for an archive where the caller doesn't necessarily need to know about the specific algo type or params but we'd also want to support fine-grained composition of these things to allow for maximum flexibility/tuning. Also for reading APIs should be able to "sniff" a data stream to determine what it is (if the format allows for that deterministically) without forcing the user to tell us via type instantiation.

@ianhays
Copy link
Contributor

ianhays commented Jun 27, 2016

I think we can look at higher-level simple to use archive APIs that might allow for selection of the compression algorithm for an archive where the caller doesn't necessarily need to know about the specific algo type or params

I think this would be the most happy compromise between full functionality and ease of use. I'm picturing a bunch of algorithm-specific streams (DeflateStream/GZipStream/LZMAStream) as well as one high-level stream (CompressionStream) that has simple write/read functionality using the defaults of the chosen compression type e.g.

    public enum Compressiontype
    {
        Deflate = 0,
        GZip = 1,
        ZLib = 2,
        LZMA = 3
    }
    public partial class CompressionStream : System.IO.Stream
    {
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode) { } // default Deflate for CompressionMode.Compress. CompressionMode.Decompress attempts to detect type from header and defaults to Deflate if it can not be determined.
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode, System.IO.Compression.CompressionType type) { } // no auto-header detection for Decompression.
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode, bool leaveOpen) { }
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode, System.IO.Compression.CompressionType type, bool leaveOpen) { }
        ...
    }

Also for reading APIs should be able to "sniff" a data stream to determine what it is (if the format allows for that deterministically) without forcing the user to tell us via type instantiation.

That would be ideal. The algorithm-specific APIs (e.g. DeflateStream) should require the given stream adhere to the format, but the higher-level API should attempt to detect the header type and default (probably to Deflate) it it can't be determined.

@ericstj
Copy link
Member

ericstj commented Jun 27, 2016

Seems reasonable so long as it doesn't hurt perf (eg: cause an additional buffer copy). I'd say that's a separate feature, but definitely one worth looking at.

@ianhays
Copy link
Contributor

ianhays commented Jun 27, 2016

I'd say that's a separate feature

Agreed. I opened https://github.com/dotnet/corefx/issues/9709 to track the discussion on that.

@GSPP
Copy link

GSPP commented Jun 28, 2016

I think the enum approach suffers from non-extensibility. Wouldn't it be better to use one derived type for each algorithm? Algorithms have different parameters and sometimes different capabilities. For example, Zip has the capability to add named entries (files). Many compression libraries offer a ZipStream that behaves like a normal stream plus the ability to define files.

Proposal: Make CompressionStream an abstract base class that anyone including user code can derive from. That way there is a standardized infrastructure for the compression mode and leaving the stream open. Also, maybe we can have standardized compression levels 1-9 that each derived algorithm interprets and uses to set it's settings.

Also, I do not see in what way a class-based approach (as opposed to an enum-based approach) would be inferior(?). Seems equal or better in every regard (apart from format detection).

I advise against format detection when decompressing. The format should be statically known and usually is. Some formats are headerless which breaks the scheme.

@ericstj
Copy link
Member

ericstj commented Jun 28, 2016

I think the enum approach suffers from non-extensibility

Good point.

I don't see how CompressionStream really solves this. We already have the Stream abstraction and all the things we're talking about controlling are on construction which isn't part of the abstraction. The only things we might expose on the stream are getters for construction parameters but I don't know if that provides enough value to justify another type in the heirarchy.

class-based approach (as opposed to an enum-based approach) would be inferior(?)

I don't think we disagree. @ianhays was talking about some convenience construction pattern that would use the public Stream derived types behind the scenes. One option for the construction pattern would be enums, but as you mention these don't version well. Perhaps we have a type CompressionParameters that has levels like you suggest and compression Stream implementations can translate those to appropriate algo-specific params.

Let's use https://github.com/dotnet/corefx/issues/9709 to track coming up with a convenience construction pattern and use this issue for describing the new streams we intend to add.

@ianhays
Copy link
Contributor

ianhays commented Jul 6, 2016

Back to the LZMA discussion:

Not sure how we're going to want to do this since the LZMA SDK includes implementations in C as well as C#. Some options and related thoughts:

  • Include LZMA C code in clrcompression and add a stream wrapper in Sys.IO.Compression like we did for ZLib
    • Isn't NetStandard.Library friendly (https://github.com/dotnet/corefx/issues/6602)
    • Probably won't work on UWP since it imports a ton of stuff from kernel32 that I doubt is supported.
    • Probably faster than managed, but perf testing that will require a lot of setup to get both methods stood up
    • Keeps our code nicely separate from LZMA code and makes upgrading easy
    • Works for LZMA/LZMA2/XZ/7z
  • Add the LZMA C# code to Sys.IO.Compression
    • Platform agnostic
    • Works only for LZMA (though could probably make it work for the others without too much trouble - would need to investigate further
    • We would become custodians of this code as it intermingles with our other compression code.
    • Easiest, fastest way to get LZMA support into Sys.IO.Compression
  • Build the C# LZMA as a separate assembly and have Sys.IO.Compression reference it.
    • Same as above, but keeps the LZMA sdk separate from our junk to simplify upgrading/updating in the future.

@GSPP
Copy link

GSPP commented Jul 7, 2016

Ideally, it would be a native code solution where possible with a managed fallback. How much slower is the C# version? My guess would be 2-3x. C compilers excel at this kind of code.

@ianhays
Copy link
Contributor

ianhays commented Jul 8, 2016

Alright, I've got native and managed versions working in Sys.IO.Compression on Windows so we can get some early perf results. Properties are set to the defaults for both.

As with ZLib, the lzma native implementation is significantly faster. Wall-clock time for 10 compress/decompress of the files in the Canterbury Corpus:

CompressionMode Managed/Native FileName Elapsed Time Output File Size
Compress Managed alice29.txt 00:00:02.5441260 48452
Compress Native alice29.txt 00:00:01.5311828 48466
Decompress Managed alice29.txt 00:00:00.1464692 152089
Decompress Native alice29.txt 00:00:00.1598902 152089
Compress Managed asyoulik.txt 00:00:01.9258908 44498
Compress Native asyoulik.txt 00:00:01.4281719 44493
Decompress Managed asyoulik.txt 00:00:00.1296718 125179
Decompress Native asyoulik.txt 00:00:00.2165359 125179
Compress Managed cp.html 00:00:00.4323960 7640
Compress Native cp.html 00:00:01.0320704 7632
Decompress Managed cp.html 00:00:00.0369613 24603
Decompress Native cp.html 00:00:00.1209998 24603
Compress Managed fields.c 00:00:00.2833704 2990
Compress Native fields.c 00:00:00.9942155 2995
Decompress Managed fields.c 00:00:00.0223107 11581
Decompress Native fields.c 00:00:00.0944675 11581
Compress Managed grammar.lsp 00:00:00.1832102 1242
Compress Native grammar.lsp 00:00:00.9598283 1242
Decompress Managed grammar.lsp 00:00:00.0161157 3721
Decompress Native grammar.lsp 00:00:00.0733605 3721
Compress Managed kennedy.xls 00:00:46.0431617 50422
Compress Native kennedy.xls 00:00:08.7667059 51396
Decompress Managed kennedy.xls 00:00:00.5156042 1029744
Decompress Native kennedy.xls 00:00:00.2552746 1029744
Compress Managed lcet10.txt 00:00:07.8654577 119556
Compress Native lcet10.txt 00:00:02.6410334 119527
Decompress Managed lcet10.txt 00:00:00.3143749 426754
Decompress Native lcet10.txt 00:00:00.2462124 426754
Compress Managed plrabn12.txt 00:00:08.1932977 165353
Compress Native plrabn12.txt 00:00:02.7910135 165319
Decompress Managed plrabn12.txt 00:00:00.4050662 481861
Decompress Native plrabn12.txt 00:00:00.2855161 481861
Compress Managed ptt5 00:00:04.0817633 43788
Compress Native ptt5 00:00:01.9063002 43503
Decompress Managed ptt5 00:00:00.1783090 513216
Decompress Native ptt5 00:00:00.1627669 513216
Compress Managed sum 00:00:00.6918768 9427
Compress Native sum 00:00:01.1139660 9430
Decompress Managed sum 00:00:00.0472818 38240
Decompress Native sum 00:00:00.1067857 38240
Compress Managed xargs.1 00:00:00.1904065 1761
Compress Native xargs.1 00:00:00.9939127 1760
Decompress Managed xargs.1 00:00:00.0208210 4227
Decompress Native xargs.1 00:00:00.0815763 4227

Note that the native lzma here is doing File IO itself, so it is going to be a little bit faster than what the end-product would be since we'd want to do the IO in C# so LZMA could follow the Stream pattern.

@GSPP
Copy link

GSPP commented Jul 8, 2016

I wonder why native and managed have so different results. kennedy.xls is smaller with managed. kennedy.xls has a crazy managed compression time. Weird.

@ianhays
Copy link
Contributor

ianhays commented Jul 8, 2016

I expect the implementations are quite different.

I also realize now that the above results are using the property values that Eric used in his CLI port. I re-ran the tests using the default values and edited the post with those results. Looks like the main difference is that managed results for small files are much faster now.

@ericstj
Copy link
Member

ericstj commented Jul 8, 2016

Yeah, those parameters were tuned for maximum compression of the CLI payload.

I believe the C & C++ implementations are multi-threaded so that makes a big difference. We could do the work to make the C# implementation multi-threaded and measure the difference.

@sqmgh
Copy link

sqmgh commented Jul 8, 2016

The results do seem to suffer from different implementations, and it does seem to make the "..the lzma native implementation is significantly faster." conclusion an unfair one.

Unless I have completely lost the ability to read a table, there are a number of results where the managed implementation is much faster than the native one. xargs.x1 being an example where ratio is almost as big as the kennedy.xls case, but in the other direction. It just impacts the overall numbers less because it's a smaller input file.

It seems that as long as you are only doing decompression, more often than not you are actually better off to use the managed implementation.

@ianhays
Copy link
Contributor

ianhays commented Jul 8, 2016

The results do seem to suffer from different implementations, and it does seem to make the "..the lzma native implementation is significantly faster." conclusion an unfair one.

Agreed. The earlier set of results that brought that conclusion were made using Eric's parameters, not the default parameters. As a result, we weren't comparing apples-to-apples. I'll strikethrough my original comment for clarity to people just coming into this thread.

We could do the work to make the C# implementation multi-threaded and measure the difference.

That would be interesting. I'll look into it. I expect it's a fairly large amount of effort, but for large files it would likely be worthwhile.

@ianhays
Copy link
Contributor

ianhays commented Jul 8, 2016

The operations being multithreaded that we care about for perf are match finding - about ~30% of the time is spent there:

image

The native LZMA parallelizes these operations when multi-threading is enabled (which it is by default).

As far as getting C# to do the same, it would require a not unreasonable amount of effort. The C# match finder is essentially a rewrite of the C non-mt match finder, so it would follow that the C# multi-threaded match finder could logically succeed as a rewrite of the C multi-threaded match finder, at least as a baseline.

@GSPP
Copy link

GSPP commented Jul 9, 2016

Compression should by default be single threaded. The threading architecture is not for a compression stream to decide. It's a useful facility to be able to turn this on, though.

There seem to be two multithreading techniques in LZMA(2): 1. The match finder which apparently does not support a high DOP. In the UI I think the DOP maxes out at 2. You can see this from the way memory usage increases when you increase the DOP. 2. Parallel compression of chunks. The latter one is infinitely scalable but should have an impact on ratio.

@ianhays
Copy link
Contributor

ianhays commented Jul 13, 2016

Compression should by default be single threaded. The threading architecture is not for a compression stream to decide. It's a useful facility to be able to turn this on, though.

That may be so, but the native LZMA is deciding. It is using 2 threads for match finding, at least on my machine in my tests. It would be useful to test it single-threaded to compare to C# on a more even ground - I'll try to find some time to do that.

There seem to be two multithreading techniques in LZMA(2): 1. The match finder which apparently does not support a high DOP. In the UI I think the DOP maxes out at 2. You can see this from the way memory usage increases when you increase the DOP. 2. Parallel compression of chunks. The latter one is infinitely scalable but should have an impact on ratio.

According to my tests, match finding being multi-threaded would give us the biggest "bang for our buck" so to speak to bridge the performance gap in the slower cases so I'd prioritize that first.

@milkshakeuk
Copy link

milkshakeuk commented Aug 9, 2016

@ianhays Hi, I was just wondering on the state of this issue? is there any pre-release/alpha/beta implementations that could be tested now?

@ianhays
Copy link
Contributor

ianhays commented Aug 9, 2016

Hi, I was just wondering on the state of this issue? is there any pre-release/alpha/beta implementations that could be tested now?

Not quite. I had to switch off to work on some other things, but I'm hoping to get back to compression for 1.1 or 1.2. You're more than welcome to take a stab at it if you'd like :)

@ianhays
Copy link
Contributor

ianhays commented Aug 25, 2016

Probably the best way to tackle this would be to add the C# code Eric linked for LZMA to System.IO.Compression and modify the API to be stream-based and similar to DeflateStream. The default values should also be changed from what is in Eric's code. That would be the fastest and simplest way to get LZMA into System.IO.Compression and the initial benchmarks show it being comparable in performance to the more complicated (and difficult to deploy to other platforms) native implementation.

If anyone wants to play around with the perf, I've got the code for both managed and native implementations ported already in my fork

An official API suggestion would also be helpful to get things moving.

@ianhays ianhays self-assigned this Nov 22, 2016
@karelz
Copy link
Member

karelz commented Jan 30, 2017

@ianhays are you working on it? (it's assigned to you) Or is it up for grabs?

@ianhays ianhays removed their assignment Jan 30, 2017
@jjxtra
Copy link

jjxtra commented Apr 9, 2021

LZ4 would be nice too, there's already nuget in pure C# with friendly licensing https://github.com/MiloszKrajewski/K4os.Compression.LZ4.

ZStandard is also interesting: https://github.com/bp74/Zstandard.Net, but it is not in pure C# yet.

@2ji3150
Copy link

2ji3150 commented May 24, 2023

I saw that windows11 is going to support rar / 7zip / etc.
Hope that .NET does so.

@MrM40
Copy link

MrM40 commented May 24, 2023

Really? That would be a surprise if W11 added native support for those formats. I think rar is a commercial non-open-source format. I guess 7zip is open-source, but it will still be a surprise if MS will implement it natively. What happens when Igor Pavlov make changes to 7zip? I don't think MS will go down that rapid hole. Just thinking out loud.
Do you have a link?

@2ji3150
Copy link

2ji3150 commented May 24, 2023

Do you have a link?

https://www.neowin.net/news/opening-rar-files-natively-in-windows-11-is-coming-and-people-online-are-going-crazy-over-it/
Yap. Seems MSFT is going to use libarchive

@ViktorHofer ViktorHofer modified the milestones: 8.0.0, Future Jun 7, 2023
@Symbai
Copy link

Symbai commented Sep 28, 2023

The Windows 11 update which supports RAR and 7zip has been released. How/when can we use that in .NET ?

@RokeJulianLockhart
Copy link

RokeJulianLockhart commented Sep 28, 2023

#1542 (comment)

@Symbai, #92763 (comment) may be relevant here.

@KieranDevvs
Copy link

Seeing as .NET 8 has just been released and planning for the next version is taking place, is there any chance we could see this prioritised for .NET 9?

@0xced
Copy link
Contributor

0xced commented Apr 19, 2024

Note

I'm a few weeks late for this joke but let's post it anyway…

In the meantime I think someone should write a dotnet wrapper around XZ Utils.

I heard version 5.6.1 is pretty fast thanks to @JiaT75 optimisations. 😂

@Mrgaton
Copy link

Mrgaton commented May 23, 2024

I need this so hard on .NET 8

cause brotli with smallest size is too slow and very bad compressing compared with 7z with lzma2 in ultra

@KieranDevvs
Copy link

@Mrgaton Agreed, there are no good LZMA/2 libraries and I've has several instances where it would have been very nice to have one.

@EatonZ
Copy link
Contributor

EatonZ commented Aug 12, 2024

Hello, I'm guessing the window has passed for this to have a chance at being included in .NET 9 later this year?

@KieranDevvs
Copy link

Hello, I'm guessing the window has passed for this to have a chance at being included in .NET 9 later this year?

Yeah there's zero chance of that happening unfortunately.

A better question would be, how much customer engagement needs to happen before this can be prioritised. I'm not trying to be demanding or arrogant and I understand each release only has so much capacity and there are other features the .NET team think is more worth while, but this issue is almost 10 years old now and its 4th most top thumbed up issue on the entire repository. We're told that thumbing up and engaging with issues on GitHub is the best way to see issues/features get looked at. I feel like if that statement is true then this issue should have been picked up a long time ago?

Can someone from the .NET team help me understand what specifically needs to happen to get this prioritised in the next available release? Is it more thumbs up? Is it more reports of customers use cases being blocked because LZMA isn't available?

@armoiredu44
Copy link

http://7-zip.org/sdk.html

bro it's a pain to use, I can't find documentation anywhere

@Mrgaton
Copy link

Mrgaton commented Sep 28, 2024

http://7-zip.org/sdk.html

bro it's a pain to use, I can't find documentation anywhere

Completely agree , we should add all the newer compression algorithms and make then easy to use like existing brotli and deflate but with the possibility of completely configuring the options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression
Projects
None yet
Development

No branches or pull requests