Add support for 7z/LZMA1/LZMA2 to System.IO.Compression #1542

joshfree · 2016-06-24T01:33:04Z

GSPP · 2016-06-24T21:39:29Z

LZMA is pretty much the state of the art general purpose compression algorithm. It pushes the Pareto Frontier outwards in quite a large range.

If there is interest in adding more algorithms there would be room for a super fast one that compresses in the range 10-300MB/s. Such algorithms exist and can be found using any search engine.

ericstj · 2016-06-24T21:43:12Z

I've already ported the C# impl for internal use in CLI: https://github.com/dotnet/cli/tree/rel/1.0.0/src/Microsoft.DotNet.Archive/LZMA

The work to do to pull this in is the following:

Better API design to follow wrapping stream design used by DeflateStream
Perf

For perf we could do a similar thing to what was done for ZLIB where we just pick up the native impl and add that to CLRCompression.

@GSPP if you have requests for other algos please file a separate issue.

ianhays · 2016-06-27T19:57:15Z

Something we should consider before we start working on adding support for Tar/Deflate/7z/LZMA/LZMA2/LZ4/BZ2/etc is how we want to design the APIs for the classes to keep them relatively similar without cluttering up Sys.IO.Compression. Options there off the top of my head:

Our current design is to give each compression method its own class (DeflateStream/GZipStream). The most consistent option would be to continue this trend for new algorithms e.g. LZMAStream/ZLibStream/BZ2Stream
Have a CompressionStream that takes an optional CompressionType enum (Deflate/Zlib/LZMA/etc) as a constructor param
Some blend of the above two.

I imagine we'll want to go with the first option so we can provide fine-grained control of the algorithm variables, but its worth considering nonetheless.

ericstj · 2016-06-27T20:06:19Z

First option is what I was thinking. I think we can look at higher-level simple to use archive APIs that might allow for selection of the compression algorithm for an archive where the caller doesn't necessarily need to know about the specific algo type or params but we'd also want to support fine-grained composition of these things to allow for maximum flexibility/tuning. Also for reading APIs should be able to "sniff" a data stream to determine what it is (if the format allows for that deterministically) without forcing the user to tell us via type instantiation.

ianhays · 2016-06-27T20:20:10Z

I think we can look at higher-level simple to use archive APIs that might allow for selection of the compression algorithm for an archive where the caller doesn't necessarily need to know about the specific algo type or params

I think this would be the most happy compromise between full functionality and ease of use. I'm picturing a bunch of algorithm-specific streams (DeflateStream/GZipStream/LZMAStream) as well as one high-level stream (CompressionStream) that has simple write/read functionality using the defaults of the chosen compression type e.g.

    public enum Compressiontype
    {
        Deflate = 0,
        GZip = 1,
        ZLib = 2,
        LZMA = 3
    }
    public partial class CompressionStream : System.IO.Stream
    {
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode) { } // default Deflate for CompressionMode.Compress. CompressionMode.Decompress attempts to detect type from header and defaults to Deflate if it can not be determined.
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode, System.IO.Compression.CompressionType type) { } // no auto-header detection for Decompression.
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode, bool leaveOpen) { }
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode, System.IO.Compression.CompressionType type, bool leaveOpen) { }
        ...
    }

Also for reading APIs should be able to "sniff" a data stream to determine what it is (if the format allows for that deterministically) without forcing the user to tell us via type instantiation.

That would be ideal. The algorithm-specific APIs (e.g. DeflateStream) should require the given stream adhere to the format, but the higher-level API should attempt to detect the header type and default (probably to Deflate) it it can't be determined.

ericstj · 2016-06-27T20:45:35Z

Seems reasonable so long as it doesn't hurt perf (eg: cause an additional buffer copy). I'd say that's a separate feature, but definitely one worth looking at.

ianhays · 2016-06-27T22:10:20Z

I'd say that's a separate feature

Agreed. I opened https://github.com/dotnet/corefx/issues/9709 to track the discussion on that.

GSPP · 2016-06-28T13:04:10Z

I think the enum approach suffers from non-extensibility. Wouldn't it be better to use one derived type for each algorithm? Algorithms have different parameters and sometimes different capabilities. For example, Zip has the capability to add named entries (files). Many compression libraries offer a ZipStream that behaves like a normal stream plus the ability to define files.

Proposal: Make CompressionStream an abstract base class that anyone including user code can derive from. That way there is a standardized infrastructure for the compression mode and leaving the stream open. Also, maybe we can have standardized compression levels 1-9 that each derived algorithm interprets and uses to set it's settings.

Also, I do not see in what way a class-based approach (as opposed to an enum-based approach) would be inferior(?). Seems equal or better in every regard (apart from format detection).

I advise against format detection when decompressing. The format should be statically known and usually is. Some formats are headerless which breaks the scheme.

ericstj · 2016-06-28T17:29:01Z

I think the enum approach suffers from non-extensibility

Good point.

I don't see how CompressionStream really solves this. We already have the Stream abstraction and all the things we're talking about controlling are on construction which isn't part of the abstraction. The only things we might expose on the stream are getters for construction parameters but I don't know if that provides enough value to justify another type in the heirarchy.

class-based approach (as opposed to an enum-based approach) would be inferior(?)

I don't think we disagree. @ianhays was talking about some convenience construction pattern that would use the public Stream derived types behind the scenes. One option for the construction pattern would be enums, but as you mention these don't version well. Perhaps we have a type CompressionParameters that has levels like you suggest and compression Stream implementations can translate those to appropriate algo-specific params.

Let's use https://github.com/dotnet/corefx/issues/9709 to track coming up with a convenience construction pattern and use this issue for describing the new streams we intend to add.

ianhays · 2016-07-06T22:40:24Z

Back to the LZMA discussion:

Not sure how we're going to want to do this since the LZMA SDK includes implementations in C as well as C#. Some options and related thoughts:

Include LZMA C code in clrcompression and add a stream wrapper in Sys.IO.Compression like we did for ZLib
- Isn't NetStandard.Library friendly (https://github.com/dotnet/corefx/issues/6602)
- Probably won't work on UWP since it imports a ton of stuff from kernel32 that I doubt is supported.
- Probably faster than managed, but perf testing that will require a lot of setup to get both methods stood up
- Keeps our code nicely separate from LZMA code and makes upgrading easy
- Works for LZMA/LZMA2/XZ/7z
Add the LZMA C# code to Sys.IO.Compression
- Platform agnostic
- Works only for LZMA (though could probably make it work for the others without too much trouble - would need to investigate further
- We would become custodians of this code as it intermingles with our other compression code.
- Easiest, fastest way to get LZMA support into Sys.IO.Compression
Build the C# LZMA as a separate assembly and have Sys.IO.Compression reference it.
- Same as above, but keeps the LZMA sdk separate from our junk to simplify upgrading/updating in the future.

GSPP · 2016-07-07T09:09:40Z

Ideally, it would be a native code solution where possible with a managed fallback. How much slower is the C# version? My guess would be 2-3x. C compilers excel at this kind of code.

ianhays · 2016-07-08T16:44:51Z

Alright, I've got native and managed versions working in Sys.IO.Compression on Windows so we can get some early perf results. Properties are set to the defaults for both.

~~As with ZLib, the lzma native implementation is significantly faster.~~ Wall-clock time for 10 compress/decompress of the files in the Canterbury Corpus:

CompressionMode	Managed/Native	FileName	Elapsed Time	Output File Size
Compress	Managed	alice29.txt	00:00:02.5441260	48452
Compress	Native	alice29.txt	00:00:01.5311828	48466
Decompress	Managed	alice29.txt	00:00:00.1464692	152089
Decompress	Native	alice29.txt	00:00:00.1598902	152089
Compress	Managed	asyoulik.txt	00:00:01.9258908	44498
Compress	Native	asyoulik.txt	00:00:01.4281719	44493
Decompress	Managed	asyoulik.txt	00:00:00.1296718	125179
Decompress	Native	asyoulik.txt	00:00:00.2165359	125179
Compress	Managed	cp.html	00:00:00.4323960	7640
Compress	Native	cp.html	00:00:01.0320704	7632
Decompress	Managed	cp.html	00:00:00.0369613	24603
Decompress	Native	cp.html	00:00:00.1209998	24603
Compress	Managed	fields.c	00:00:00.2833704	2990
Compress	Native	fields.c	00:00:00.9942155	2995
Decompress	Managed	fields.c	00:00:00.0223107	11581
Decompress	Native	fields.c	00:00:00.0944675	11581
Compress	Managed	grammar.lsp	00:00:00.1832102	1242
Compress	Native	grammar.lsp	00:00:00.9598283	1242
Decompress	Managed	grammar.lsp	00:00:00.0161157	3721
Decompress	Native	grammar.lsp	00:00:00.0733605	3721
Compress	Managed	kennedy.xls	00:00:46.0431617	50422
Compress	Native	kennedy.xls	00:00:08.7667059	51396
Decompress	Managed	kennedy.xls	00:00:00.5156042	1029744
Decompress	Native	kennedy.xls	00:00:00.2552746	1029744
Compress	Managed	lcet10.txt	00:00:07.8654577	119556
Compress	Native	lcet10.txt	00:00:02.6410334	119527
Decompress	Managed	lcet10.txt	00:00:00.3143749	426754
Decompress	Native	lcet10.txt	00:00:00.2462124	426754
Compress	Managed	plrabn12.txt	00:00:08.1932977	165353
Compress	Native	plrabn12.txt	00:00:02.7910135	165319
Decompress	Managed	plrabn12.txt	00:00:00.4050662	481861
Decompress	Native	plrabn12.txt	00:00:00.2855161	481861
Compress	Managed	ptt5	00:00:04.0817633	43788
Compress	Native	ptt5	00:00:01.9063002	43503
Decompress	Managed	ptt5	00:00:00.1783090	513216
Decompress	Native	ptt5	00:00:00.1627669	513216
Compress	Managed	sum	00:00:00.6918768	9427
Compress	Native	sum	00:00:01.1139660	9430
Decompress	Managed	sum	00:00:00.0472818	38240
Decompress	Native	sum	00:00:00.1067857	38240
Compress	Managed	xargs.1	00:00:00.1904065	1761
Compress	Native	xargs.1	00:00:00.9939127	1760
Decompress	Managed	xargs.1	00:00:00.0208210	4227
Decompress	Native	xargs.1	00:00:00.0815763	4227

Note that the native lzma here is doing File IO itself, so it is going to be a little bit faster than what the end-product would be since we'd want to do the IO in C# so LZMA could follow the Stream pattern.

GSPP · 2016-07-08T17:09:54Z

I wonder why native and managed have so different results. kennedy.xls is smaller with managed. kennedy.xls has a crazy managed compression time. Weird.

ianhays · 2016-07-08T17:28:29Z

I expect the implementations are quite different.

I also realize now that the above results are using the property values that Eric used in his CLI port. I re-ran the tests using the default values and edited the post with those results. Looks like the main difference is that managed results for small files are much faster now.

ericstj · 2016-07-08T18:39:30Z

Yeah, those parameters were tuned for maximum compression of the CLI payload.

I believe the C & C++ implementations are multi-threaded so that makes a big difference. We could do the work to make the C# implementation multi-threaded and measure the difference.

sqmgh · 2016-07-08T18:51:18Z

The results do seem to suffer from different implementations, and it does seem to make the "..the lzma native implementation is significantly faster." conclusion an unfair one.

Unless I have completely lost the ability to read a table, there are a number of results where the managed implementation is much faster than the native one. xargs.x1 being an example where ratio is almost as big as the kennedy.xls case, but in the other direction. It just impacts the overall numbers less because it's a smaller input file.

It seems that as long as you are only doing decompression, more often than not you are actually better off to use the managed implementation.

ianhays · 2016-07-08T18:59:30Z

The results do seem to suffer from different implementations, and it does seem to make the "..the lzma native implementation is significantly faster." conclusion an unfair one.

Agreed. The earlier set of results that brought that conclusion were made using Eric's parameters, not the default parameters. As a result, we weren't comparing apples-to-apples. I'll strikethrough my original comment for clarity to people just coming into this thread.

We could do the work to make the C# implementation multi-threaded and measure the difference.

That would be interesting. I'll look into it. I expect it's a fairly large amount of effort, but for large files it would likely be worthwhile.

ianhays · 2016-07-08T22:16:32Z

The operations being multithreaded that we care about for perf are match finding - about ~30% of the time is spent there:

The native LZMA parallelizes these operations when multi-threading is enabled (which it is by default).

As far as getting C# to do the same, it would require a not unreasonable amount of effort. The C# match finder is essentially a rewrite of the C non-mt match finder, so it would follow that the C# multi-threaded match finder could logically succeed as a rewrite of the C multi-threaded match finder, at least as a baseline.

GSPP · 2016-07-09T11:20:22Z

Compression should by default be single threaded. The threading architecture is not for a compression stream to decide. It's a useful facility to be able to turn this on, though.

There seem to be two multithreading techniques in LZMA(2): 1. The match finder which apparently does not support a high DOP. In the UI I think the DOP maxes out at 2. You can see this from the way memory usage increases when you increase the DOP. 2. Parallel compression of chunks. The latter one is infinitely scalable but should have an impact on ratio.

ianhays · 2016-07-13T20:47:05Z

Compression should by default be single threaded. The threading architecture is not for a compression stream to decide. It's a useful facility to be able to turn this on, though.

That may be so, but the native LZMA is deciding. It is using 2 threads for match finding, at least on my machine in my tests. It would be useful to test it single-threaded to compare to C# on a more even ground - I'll try to find some time to do that.

There seem to be two multithreading techniques in LZMA(2): 1. The match finder which apparently does not support a high DOP. In the UI I think the DOP maxes out at 2. You can see this from the way memory usage increases when you increase the DOP. 2. Parallel compression of chunks. The latter one is infinitely scalable but should have an impact on ratio.

According to my tests, match finding being multi-threaded would give us the biggest "bang for our buck" so to speak to bridge the performance gap in the slower cases so I'd prioritize that first.

milkshakeuk · 2016-08-09T19:15:46Z

@ianhays Hi, I was just wondering on the state of this issue? is there any pre-release/alpha/beta implementations that could be tested now?

ianhays · 2016-08-09T19:40:46Z

Hi, I was just wondering on the state of this issue? is there any pre-release/alpha/beta implementations that could be tested now?

Not quite. I had to switch off to work on some other things, but I'm hoping to get back to compression for 1.1 or 1.2. You're more than welcome to take a stab at it if you'd like :)

ianhays · 2016-08-25T17:17:00Z

Probably the best way to tackle this would be to add the C# code Eric linked for LZMA to System.IO.Compression and modify the API to be stream-based and similar to DeflateStream. The default values should also be changed from what is in Eric's code. That would be the fastest and simplest way to get LZMA into System.IO.Compression and the initial benchmarks show it being comparable in performance to the more complicated (and difficult to deploy to other platforms) native implementation.

If anyone wants to play around with the perf, I've got the code for both managed and native implementations ported already in my fork

An official API suggestion would also be helpful to get things moving.

karelz · 2017-01-30T07:34:42Z

@ianhays are you working on it? (it's assigned to you) Or is it up for grabs?

jjxtra · 2021-04-09T16:28:49Z

LZ4 would be nice too, there's already nuget in pure C# with friendly licensing https://github.com/MiloszKrajewski/K4os.Compression.LZ4.

ZStandard is also interesting: https://github.com/bp74/Zstandard.Net, but it is not in pure C# yet.

2ji3150 · 2023-05-24T13:18:02Z

I saw that windows11 is going to support rar / 7zip / etc.
Hope that .NET does so.

MrM40 · 2023-05-24T13:44:59Z

Really? That would be a surprise if W11 added native support for those formats. I think rar is a commercial non-open-source format. I guess 7zip is open-source, but it will still be a surprise if MS will implement it natively. What happens when Igor Pavlov make changes to 7zip? I don't think MS will go down that rapid hole. Just thinking out loud.
Do you have a link?

2ji3150 · 2023-05-24T13:48:47Z

Do you have a link?

https://www.neowin.net/news/opening-rar-files-natively-in-windows-11-is-coming-and-people-online-are-going-crazy-over-it/
Yap. Seems MSFT is going to use libarchive

Symbai · 2023-09-28T07:08:07Z

The Windows 11 update which supports RAR and 7zip has been released. How/when can we use that in .NET ?

RokeJulianLockhart · 2023-09-28T12:46:48Z

#1542 (comment)

@Symbai, #92763 (comment) may be relevant here.

KieranDevvs · 2023-12-26T23:05:49Z

Seeing as .NET 8 has just been released and planning for the next version is taking place, is there any chance we could see this prioritised for .NET 9?

0xced · 2024-04-19T11:42:55Z

Note

I'm a few weeks late for this joke but let's post it anyway…

In the meantime I think someone should write a dotnet wrapper around XZ Utils.

I heard version 5.6.1 is pretty fast thanks to @JiaT75 optimisations. 😂

Mrgaton · 2024-05-23T18:03:37Z

I need this so hard on .NET 8

cause brotli with smallest size is too slow and very bad compressing compared with 7z with lzma2 in ultra

KieranDevvs · 2024-05-23T19:46:36Z

@Mrgaton Agreed, there are no good LZMA/2 libraries and I've has several instances where it would have been very nice to have one.

EatonZ · 2024-08-12T21:18:01Z

Hello, I'm guessing the window has passed for this to have a chance at being included in .NET 9 later this year?

KieranDevvs · 2024-08-12T22:49:57Z

Hello, I'm guessing the window has passed for this to have a chance at being included in .NET 9 later this year?

Yeah there's zero chance of that happening unfortunately.

A better question would be, how much customer engagement needs to happen before this can be prioritised. I'm not trying to be demanding or arrogant and I understand each release only has so much capacity and there are other features the .NET team think is more worth while, but this issue is almost 10 years old now and its 4th most top thumbed up issue on the entire repository. We're told that thumbing up and engaging with issues on GitHub is the best way to see issues/features get looked at. I feel like if that statement is true then this issue should have been picked up a long time ago?

Can someone from the .NET team help me understand what specifically needs to happen to get this prioritised in the next available release? Is it more thumbs up? Is it more reports of customers use cases being blocked because LZMA isn't available?

armoiredu44 · 2024-09-28T18:51:25Z

http://7-zip.org/sdk.html

bro it's a pain to use, I can't find documentation anywhere

Mrgaton · 2024-09-28T19:37:17Z

http://7-zip.org/sdk.html

bro it's a pain to use, I can't find documentation anywhere

Completely agree , we should add all the newer compression algorithms and make then easy to use like existing brotli and deflate but with the possibility of completely configuring the options.

joshfree assigned ianhays and ericstj Jun 24, 2016

ianhays unassigned ianhays and ericstj Oct 10, 2016

ianhays self-assigned this Nov 22, 2016

ianhays removed their assignment Jan 30, 2017

carlossanlop added this to the 5.0 milestone Jan 9, 2020

carlossanlop modified the milestones: 5.0.0, Future Jun 18, 2020

Genbox mentioned this issue Nov 12, 2020

Compression support for single-file publish #44569

Closed

KirillOsenkov mentioned this issue Apr 5, 2021

Look into LZMA compression for binlog envelope KirillOsenkov/MSBuildStructuredLog#466

Open

carlossanlop mentioned this issue Dec 10, 2021

System.IO.Compression work planned for .NET 7 #62658

Closed

28 tasks

jeffhandley modified the milestones: Future, 7.0.0 Jan 9, 2022

PaulusParssinen mentioned this issue Feb 2, 2022

ZLib compression overhaul ArachisH/Flazzy#5

Merged

eveloki mentioned this issue Apr 14, 2022

Use 7Zip ravibpatel/AutoUpdater.NET#525

Closed

jeffhandley modified the milestones: 7.0.0, 8.0.0 Jul 9, 2022

am11 mentioned this issue Aug 22, 2022

TarReader throws on various archives that other tools accept #74316

Closed

ViktorHofer modified the milestones: 8.0.0, Future Jun 7, 2023

Add support for 7z/LZMA1/LZMA2 to System.IO.Compression #1542

Add support for 7z/LZMA1/LZMA2 to System.IO.Compression #1542

Comments

joshfree commented Jun 24, 2016

GSPP commented Jun 24, 2016

ericstj commented Jun 24, 2016

ianhays commented Jun 27, 2016

ericstj commented Jun 27, 2016

ianhays commented Jun 27, 2016

ericstj commented Jun 27, 2016

ianhays commented Jun 27, 2016

GSPP commented Jun 28, 2016 • edited Loading

ericstj commented Jun 28, 2016 • edited Loading

ianhays commented Jul 6, 2016

GSPP commented Jul 7, 2016

ianhays commented Jul 8, 2016 • edited Loading

GSPP commented Jul 8, 2016

ianhays commented Jul 8, 2016

ericstj commented Jul 8, 2016

sqmgh commented Jul 8, 2016 • edited Loading

ianhays commented Jul 8, 2016

ianhays commented Jul 8, 2016

GSPP commented Jul 9, 2016

ianhays commented Jul 13, 2016

milkshakeuk commented Aug 9, 2016 • edited Loading

ianhays commented Aug 9, 2016

ianhays commented Aug 25, 2016 • edited Loading

karelz commented Jan 30, 2017

jjxtra commented Apr 9, 2021

2ji3150 commented May 24, 2023

MrM40 commented May 24, 2023

2ji3150 commented May 24, 2023 • edited Loading

Symbai commented Sep 28, 2023

RokeJulianLockhart commented Sep 28, 2023 • edited Loading

KieranDevvs commented Dec 26, 2023

0xced commented Apr 19, 2024

Mrgaton commented May 23, 2024 • edited Loading

KieranDevvs commented May 23, 2024

EatonZ commented Aug 12, 2024

KieranDevvs commented Aug 12, 2024

armoiredu44 commented Sep 28, 2024

Mrgaton commented Sep 28, 2024 • edited Loading

GSPP commented Jun 28, 2016 •

edited

Loading

ericstj commented Jun 28, 2016 •

edited

Loading

ianhays commented Jul 8, 2016 •

edited

Loading

sqmgh commented Jul 8, 2016 •

edited

Loading

milkshakeuk commented Aug 9, 2016 •

edited

Loading

ianhays commented Aug 25, 2016 •

edited

Loading

2ji3150 commented May 24, 2023 •

edited

Loading

RokeJulianLockhart commented Sep 28, 2023 •

edited

Loading

Mrgaton commented May 23, 2024 •

edited

Loading

Mrgaton commented Sep 28, 2024 •

edited

Loading