New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for 7z/LZMA1/LZMA2 to System.IO.Compression #9657

Open
joshfree opened this Issue Jun 24, 2016 · 37 comments

Comments

Projects
None yet
@joshfree
Member

joshfree commented Jun 24, 2016

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Jun 24, 2016

LZMA is pretty much the state of the art general purpose compression algorithm. It pushes the Pareto Frontier outwards in quite a large range.

If there is interest in adding more algorithms there would be room for a super fast one that compresses in the range 10-300MB/s. Such algorithms exist and can be found using any search engine.

GSPP commented Jun 24, 2016

LZMA is pretty much the state of the art general purpose compression algorithm. It pushes the Pareto Frontier outwards in quite a large range.

If there is interest in adding more algorithms there would be room for a super fast one that compresses in the range 10-300MB/s. Such algorithms exist and can be found using any search engine.

@ericstj

This comment has been minimized.

Show comment
Hide comment
@ericstj

ericstj Jun 24, 2016

Member

I've already ported the C# impl for internal use in CLI: https://github.com/dotnet/cli/tree/rel/1.0.0/src/Microsoft.DotNet.Archive/LZMA

The work to do to pull this in is the following:

  1. Better API design to follow wrapping stream design used by DeflateStream
  2. Perf

For perf we could do a similar thing to what was done for ZLIB where we just pick up the native impl and add that to CLRCompression.

@GSPP if you have requests for other algos please file a separate issue.

Member

ericstj commented Jun 24, 2016

I've already ported the C# impl for internal use in CLI: https://github.com/dotnet/cli/tree/rel/1.0.0/src/Microsoft.DotNet.Archive/LZMA

The work to do to pull this in is the following:

  1. Better API design to follow wrapping stream design used by DeflateStream
  2. Perf

For perf we could do a similar thing to what was done for ZLIB where we just pick up the native impl and add that to CLRCompression.

@GSPP if you have requests for other algos please file a separate issue.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jun 27, 2016

Member

Something we should consider before we start working on adding support for Tar/Deflate/7z/LZMA/LZMA2/LZ4/BZ2/etc is how we want to design the APIs for the classes to keep them relatively similar without cluttering up Sys.IO.Compression. Options there off the top of my head:

  • Our current design is to give each compression method its own class (DeflateStream/GZipStream). The most consistent option would be to continue this trend for new algorithms e.g. LZMAStream/ZLibStream/BZ2Stream
  • Have a CompressionStream that takes an optional CompressionType enum (Deflate/Zlib/LZMA/etc) as a constructor param
  • Some blend of the above two.

I imagine we'll want to go with the first option so we can provide fine-grained control of the algorithm variables, but its worth considering nonetheless.

Member

ianhays commented Jun 27, 2016

Something we should consider before we start working on adding support for Tar/Deflate/7z/LZMA/LZMA2/LZ4/BZ2/etc is how we want to design the APIs for the classes to keep them relatively similar without cluttering up Sys.IO.Compression. Options there off the top of my head:

  • Our current design is to give each compression method its own class (DeflateStream/GZipStream). The most consistent option would be to continue this trend for new algorithms e.g. LZMAStream/ZLibStream/BZ2Stream
  • Have a CompressionStream that takes an optional CompressionType enum (Deflate/Zlib/LZMA/etc) as a constructor param
  • Some blend of the above two.

I imagine we'll want to go with the first option so we can provide fine-grained control of the algorithm variables, but its worth considering nonetheless.

@ericstj

This comment has been minimized.

Show comment
Hide comment
@ericstj

ericstj Jun 27, 2016

Member

First option is what I was thinking. I think we can look at higher-level simple to use archive APIs that might allow for selection of the compression algorithm for an archive where the caller doesn't necessarily need to know about the specific algo type or params but we'd also want to support fine-grained composition of these things to allow for maximum flexibility/tuning. Also for reading APIs should be able to "sniff" a data stream to determine what it is (if the format allows for that deterministically) without forcing the user to tell us via type instantiation.

Member

ericstj commented Jun 27, 2016

First option is what I was thinking. I think we can look at higher-level simple to use archive APIs that might allow for selection of the compression algorithm for an archive where the caller doesn't necessarily need to know about the specific algo type or params but we'd also want to support fine-grained composition of these things to allow for maximum flexibility/tuning. Also for reading APIs should be able to "sniff" a data stream to determine what it is (if the format allows for that deterministically) without forcing the user to tell us via type instantiation.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jun 27, 2016

Member

I think we can look at higher-level simple to use archive APIs that might allow for selection of the compression algorithm for an archive where the caller doesn't necessarily need to know about the specific algo type or params

I think this would be the most happy compromise between full functionality and ease of use. I'm picturing a bunch of algorithm-specific streams (DeflateStream/GZipStream/LZMAStream) as well as one high-level stream (CompressionStream) that has simple write/read functionality using the defaults of the chosen compression type e.g.

    public enum Compressiontype
    {
        Deflate = 0,
        GZip = 1,
        ZLib = 2,
        LZMA = 3
    }
    public partial class CompressionStream : System.IO.Stream
    {
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode) { } // default Deflate for CompressionMode.Compress. CompressionMode.Decompress attempts to detect type from header and defaults to Deflate if it can not be determined.
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode, System.IO.Compression.CompressionType type) { } // no auto-header detection for Decompression.
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode, bool leaveOpen) { }
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode, System.IO.Compression.CompressionType type, bool leaveOpen) { }
        ...
    }

Also for reading APIs should be able to "sniff" a data stream to determine what it is (if the format allows for that deterministically) without forcing the user to tell us via type instantiation.

That would be ideal. The algorithm-specific APIs (e.g. DeflateStream) should require the given stream adhere to the format, but the higher-level API should attempt to detect the header type and default (probably to Deflate) it it can't be determined.

Member

ianhays commented Jun 27, 2016

I think we can look at higher-level simple to use archive APIs that might allow for selection of the compression algorithm for an archive where the caller doesn't necessarily need to know about the specific algo type or params

I think this would be the most happy compromise between full functionality and ease of use. I'm picturing a bunch of algorithm-specific streams (DeflateStream/GZipStream/LZMAStream) as well as one high-level stream (CompressionStream) that has simple write/read functionality using the defaults of the chosen compression type e.g.

    public enum Compressiontype
    {
        Deflate = 0,
        GZip = 1,
        ZLib = 2,
        LZMA = 3
    }
    public partial class CompressionStream : System.IO.Stream
    {
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode) { } // default Deflate for CompressionMode.Compress. CompressionMode.Decompress attempts to detect type from header and defaults to Deflate if it can not be determined.
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode, System.IO.Compression.CompressionType type) { } // no auto-header detection for Decompression.
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode, bool leaveOpen) { }
        public CompressionStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode, System.IO.Compression.CompressionType type, bool leaveOpen) { }
        ...
    }

Also for reading APIs should be able to "sniff" a data stream to determine what it is (if the format allows for that deterministically) without forcing the user to tell us via type instantiation.

That would be ideal. The algorithm-specific APIs (e.g. DeflateStream) should require the given stream adhere to the format, but the higher-level API should attempt to detect the header type and default (probably to Deflate) it it can't be determined.

@ericstj

This comment has been minimized.

Show comment
Hide comment
@ericstj

ericstj Jun 27, 2016

Member

Seems reasonable so long as it doesn't hurt perf (eg: cause an additional buffer copy). I'd say that's a separate feature, but definitely one worth looking at.

Member

ericstj commented Jun 27, 2016

Seems reasonable so long as it doesn't hurt perf (eg: cause an additional buffer copy). I'd say that's a separate feature, but definitely one worth looking at.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jun 27, 2016

Member

I'd say that's a separate feature

Agreed. I opened #9709 to track the discussion on that.

Member

ianhays commented Jun 27, 2016

I'd say that's a separate feature

Agreed. I opened #9709 to track the discussion on that.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Jun 28, 2016

I think the enum approach suffers from non-extensibility. Wouldn't it be better to use one derived type for each algorithm? Algorithms have different parameters and sometimes different capabilities. For example, Zip has the capability to add named entries (files). Many compression libraries offer a ZipStream that behaves like a normal stream plus the ability to define files.

Proposal: Make CompressionStream an abstract base class that anyone including user code can derive from. That way there is a standardized infrastructure for the compression mode and leaving the stream open. Also, maybe we can have standardized compression levels 1-9 that each derived algorithm interprets and uses to set it's settings.

Also, I do not see in what way a class-based approach (as opposed to an enum-based approach) would be inferior(?). Seems equal or better in every regard (apart from format detection).

I advise against format detection when decompressing. The format should be statically known and usually is. Some formats are headerless which breaks the scheme.

GSPP commented Jun 28, 2016

I think the enum approach suffers from non-extensibility. Wouldn't it be better to use one derived type for each algorithm? Algorithms have different parameters and sometimes different capabilities. For example, Zip has the capability to add named entries (files). Many compression libraries offer a ZipStream that behaves like a normal stream plus the ability to define files.

Proposal: Make CompressionStream an abstract base class that anyone including user code can derive from. That way there is a standardized infrastructure for the compression mode and leaving the stream open. Also, maybe we can have standardized compression levels 1-9 that each derived algorithm interprets and uses to set it's settings.

Also, I do not see in what way a class-based approach (as opposed to an enum-based approach) would be inferior(?). Seems equal or better in every regard (apart from format detection).

I advise against format detection when decompressing. The format should be statically known and usually is. Some formats are headerless which breaks the scheme.

@ericstj

This comment has been minimized.

Show comment
Hide comment
@ericstj

ericstj Jun 28, 2016

Member

I think the enum approach suffers from non-extensibility

Good point.

I don't see how CompressionStream really solves this. We already have the Stream abstraction and all the things we're talking about controlling are on construction which isn't part of the abstraction. The only things we might expose on the stream are getters for construction parameters but I don't know if that provides enough value to justify another type in the heirarchy.

class-based approach (as opposed to an enum-based approach) would be inferior(?)

I don't think we disagree. @ianhays was talking about some convenience construction pattern that would use the public Stream derived types behind the scenes. One option for the construction pattern would be enums, but as you mention these don't version well. Perhaps we have a type CompressionParameters that has levels like you suggest and compression Stream implementations can translate those to appropriate algo-specific params.

Let's use #9709 to track coming up with a convenience construction pattern and use this issue for describing the new streams we intend to add.

Member

ericstj commented Jun 28, 2016

I think the enum approach suffers from non-extensibility

Good point.

I don't see how CompressionStream really solves this. We already have the Stream abstraction and all the things we're talking about controlling are on construction which isn't part of the abstraction. The only things we might expose on the stream are getters for construction parameters but I don't know if that provides enough value to justify another type in the heirarchy.

class-based approach (as opposed to an enum-based approach) would be inferior(?)

I don't think we disagree. @ianhays was talking about some convenience construction pattern that would use the public Stream derived types behind the scenes. One option for the construction pattern would be enums, but as you mention these don't version well. Perhaps we have a type CompressionParameters that has levels like you suggest and compression Stream implementations can translate those to appropriate algo-specific params.

Let's use #9709 to track coming up with a convenience construction pattern and use this issue for describing the new streams we intend to add.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jul 6, 2016

Member

Back to the LZMA discussion:

Not sure how we're going to want to do this since the LZMA SDK includes implementations in C as well as C#. Some options and related thoughts:

  • Include LZMA C code in clrcompression and add a stream wrapper in Sys.IO.Compression like we did for ZLib
    • Isn't NetStandard.Library friendly (#6602)
    • Probably won't work on UWP since it imports a ton of stuff from kernel32 that I doubt is supported.
    • Probably faster than managed, but perf testing that will require a lot of setup to get both methods stood up
    • Keeps our code nicely separate from LZMA code and makes upgrading easy
    • Works for LZMA/LZMA2/XZ/7z
  • Add the LZMA C# code to Sys.IO.Compression
    • Platform agnostic
    • Works only for LZMA (though could probably make it work for the others without too much trouble - would need to investigate further
    • We would become custodians of this code as it intermingles with our other compression code.
    • Easiest, fastest way to get LZMA support into Sys.IO.Compression
  • Build the C# LZMA as a separate assembly and have Sys.IO.Compression reference it.
    • Same as above, but keeps the LZMA sdk separate from our junk to simplify upgrading/updating in the future.
Member

ianhays commented Jul 6, 2016

Back to the LZMA discussion:

Not sure how we're going to want to do this since the LZMA SDK includes implementations in C as well as C#. Some options and related thoughts:

  • Include LZMA C code in clrcompression and add a stream wrapper in Sys.IO.Compression like we did for ZLib
    • Isn't NetStandard.Library friendly (#6602)
    • Probably won't work on UWP since it imports a ton of stuff from kernel32 that I doubt is supported.
    • Probably faster than managed, but perf testing that will require a lot of setup to get both methods stood up
    • Keeps our code nicely separate from LZMA code and makes upgrading easy
    • Works for LZMA/LZMA2/XZ/7z
  • Add the LZMA C# code to Sys.IO.Compression
    • Platform agnostic
    • Works only for LZMA (though could probably make it work for the others without too much trouble - would need to investigate further
    • We would become custodians of this code as it intermingles with our other compression code.
    • Easiest, fastest way to get LZMA support into Sys.IO.Compression
  • Build the C# LZMA as a separate assembly and have Sys.IO.Compression reference it.
    • Same as above, but keeps the LZMA sdk separate from our junk to simplify upgrading/updating in the future.
@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Jul 7, 2016

Ideally, it would be a native code solution where possible with a managed fallback. How much slower is the C# version? My guess would be 2-3x. C compilers excel at this kind of code.

GSPP commented Jul 7, 2016

Ideally, it would be a native code solution where possible with a managed fallback. How much slower is the C# version? My guess would be 2-3x. C compilers excel at this kind of code.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jul 8, 2016

Member

Alright, I've got native and managed versions working in Sys.IO.Compression on Windows so we can get some early perf results. Properties are set to the defaults for both.

As with ZLib, the lzma native implementation is significantly faster. Wall-clock time for 10 compress/decompress of the files in the Canterbury Corpus:

CompressionMode Managed/Native FileName Elapsed Time Output File Size
Compress Managed alice29.txt 00:00:02.5441260 48452
Compress Native alice29.txt 00:00:01.5311828 48466
Decompress Managed alice29.txt 00:00:00.1464692 152089
Decompress Native alice29.txt 00:00:00.1598902 152089
Compress Managed asyoulik.txt 00:00:01.9258908 44498
Compress Native asyoulik.txt 00:00:01.4281719 44493
Decompress Managed asyoulik.txt 00:00:00.1296718 125179
Decompress Native asyoulik.txt 00:00:00.2165359 125179
Compress Managed cp.html 00:00:00.4323960 7640
Compress Native cp.html 00:00:01.0320704 7632
Decompress Managed cp.html 00:00:00.0369613 24603
Decompress Native cp.html 00:00:00.1209998 24603
Compress Managed fields.c 00:00:00.2833704 2990
Compress Native fields.c 00:00:00.9942155 2995
Decompress Managed fields.c 00:00:00.0223107 11581
Decompress Native fields.c 00:00:00.0944675 11581
Compress Managed grammar.lsp 00:00:00.1832102 1242
Compress Native grammar.lsp 00:00:00.9598283 1242
Decompress Managed grammar.lsp 00:00:00.0161157 3721
Decompress Native grammar.lsp 00:00:00.0733605 3721
Compress Managed kennedy.xls 00:00:46.0431617 50422
Compress Native kennedy.xls 00:00:08.7667059 51396
Decompress Managed kennedy.xls 00:00:00.5156042 1029744
Decompress Native kennedy.xls 00:00:00.2552746 1029744
Compress Managed lcet10.txt 00:00:07.8654577 119556
Compress Native lcet10.txt 00:00:02.6410334 119527
Decompress Managed lcet10.txt 00:00:00.3143749 426754
Decompress Native lcet10.txt 00:00:00.2462124 426754
Compress Managed plrabn12.txt 00:00:08.1932977 165353
Compress Native plrabn12.txt 00:00:02.7910135 165319
Decompress Managed plrabn12.txt 00:00:00.4050662 481861
Decompress Native plrabn12.txt 00:00:00.2855161 481861
Compress Managed ptt5 00:00:04.0817633 43788
Compress Native ptt5 00:00:01.9063002 43503
Decompress Managed ptt5 00:00:00.1783090 513216
Decompress Native ptt5 00:00:00.1627669 513216
Compress Managed sum 00:00:00.6918768 9427
Compress Native sum 00:00:01.1139660 9430
Decompress Managed sum 00:00:00.0472818 38240
Decompress Native sum 00:00:00.1067857 38240
Compress Managed xargs.1 00:00:00.1904065 1761
Compress Native xargs.1 00:00:00.9939127 1760
Decompress Managed xargs.1 00:00:00.0208210 4227
Decompress Native xargs.1 00:00:00.0815763 4227

Note that the native lzma here is doing File IO itself, so it is going to be a little bit faster than what the end-product would be since we'd want to do the IO in C# so LZMA could follow the Stream pattern.

Member

ianhays commented Jul 8, 2016

Alright, I've got native and managed versions working in Sys.IO.Compression on Windows so we can get some early perf results. Properties are set to the defaults for both.

As with ZLib, the lzma native implementation is significantly faster. Wall-clock time for 10 compress/decompress of the files in the Canterbury Corpus:

CompressionMode Managed/Native FileName Elapsed Time Output File Size
Compress Managed alice29.txt 00:00:02.5441260 48452
Compress Native alice29.txt 00:00:01.5311828 48466
Decompress Managed alice29.txt 00:00:00.1464692 152089
Decompress Native alice29.txt 00:00:00.1598902 152089
Compress Managed asyoulik.txt 00:00:01.9258908 44498
Compress Native asyoulik.txt 00:00:01.4281719 44493
Decompress Managed asyoulik.txt 00:00:00.1296718 125179
Decompress Native asyoulik.txt 00:00:00.2165359 125179
Compress Managed cp.html 00:00:00.4323960 7640
Compress Native cp.html 00:00:01.0320704 7632
Decompress Managed cp.html 00:00:00.0369613 24603
Decompress Native cp.html 00:00:00.1209998 24603
Compress Managed fields.c 00:00:00.2833704 2990
Compress Native fields.c 00:00:00.9942155 2995
Decompress Managed fields.c 00:00:00.0223107 11581
Decompress Native fields.c 00:00:00.0944675 11581
Compress Managed grammar.lsp 00:00:00.1832102 1242
Compress Native grammar.lsp 00:00:00.9598283 1242
Decompress Managed grammar.lsp 00:00:00.0161157 3721
Decompress Native grammar.lsp 00:00:00.0733605 3721
Compress Managed kennedy.xls 00:00:46.0431617 50422
Compress Native kennedy.xls 00:00:08.7667059 51396
Decompress Managed kennedy.xls 00:00:00.5156042 1029744
Decompress Native kennedy.xls 00:00:00.2552746 1029744
Compress Managed lcet10.txt 00:00:07.8654577 119556
Compress Native lcet10.txt 00:00:02.6410334 119527
Decompress Managed lcet10.txt 00:00:00.3143749 426754
Decompress Native lcet10.txt 00:00:00.2462124 426754
Compress Managed plrabn12.txt 00:00:08.1932977 165353
Compress Native plrabn12.txt 00:00:02.7910135 165319
Decompress Managed plrabn12.txt 00:00:00.4050662 481861
Decompress Native plrabn12.txt 00:00:00.2855161 481861
Compress Managed ptt5 00:00:04.0817633 43788
Compress Native ptt5 00:00:01.9063002 43503
Decompress Managed ptt5 00:00:00.1783090 513216
Decompress Native ptt5 00:00:00.1627669 513216
Compress Managed sum 00:00:00.6918768 9427
Compress Native sum 00:00:01.1139660 9430
Decompress Managed sum 00:00:00.0472818 38240
Decompress Native sum 00:00:00.1067857 38240
Compress Managed xargs.1 00:00:00.1904065 1761
Compress Native xargs.1 00:00:00.9939127 1760
Decompress Managed xargs.1 00:00:00.0208210 4227
Decompress Native xargs.1 00:00:00.0815763 4227

Note that the native lzma here is doing File IO itself, so it is going to be a little bit faster than what the end-product would be since we'd want to do the IO in C# so LZMA could follow the Stream pattern.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Jul 8, 2016

I wonder why native and managed have so different results. kennedy.xls is smaller with managed. kennedy.xls has a crazy managed compression time. Weird.

GSPP commented Jul 8, 2016

I wonder why native and managed have so different results. kennedy.xls is smaller with managed. kennedy.xls has a crazy managed compression time. Weird.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jul 8, 2016

Member

I expect the implementations are quite different.

I also realize now that the above results are using the property values that Eric used in his CLI port. I re-ran the tests using the default values and edited the post with those results. Looks like the main difference is that managed results for small files are much faster now.

Member

ianhays commented Jul 8, 2016

I expect the implementations are quite different.

I also realize now that the above results are using the property values that Eric used in his CLI port. I re-ran the tests using the default values and edited the post with those results. Looks like the main difference is that managed results for small files are much faster now.

@ericstj

This comment has been minimized.

Show comment
Hide comment
@ericstj

ericstj Jul 8, 2016

Member

Yeah, those parameters were tuned for maximum compression of the CLI payload.

I believe the C & C++ implementations are multi-threaded so that makes a big difference. We could do the work to make the C# implementation multi-threaded and measure the difference.

Member

ericstj commented Jul 8, 2016

Yeah, those parameters were tuned for maximum compression of the CLI payload.

I believe the C & C++ implementations are multi-threaded so that makes a big difference. We could do the work to make the C# implementation multi-threaded and measure the difference.

@sqmgh

This comment has been minimized.

Show comment
Hide comment
@sqmgh

sqmgh Jul 8, 2016

The results do seem to suffer from different implementations, and it does seem to make the "..the lzma native implementation is significantly faster." conclusion an unfair one.

Unless I have completely lost the ability to read a table, there are a number of results where the managed implementation is much faster than the native one. xargs.x1 being an example where ratio is almost as big as the kennedy.xls case, but in the other direction. It just impacts the overall numbers less because it's a smaller input file.

It seems that as long as you are only doing decompression, more often than not you are actually better off to use the managed implementation.

sqmgh commented Jul 8, 2016

The results do seem to suffer from different implementations, and it does seem to make the "..the lzma native implementation is significantly faster." conclusion an unfair one.

Unless I have completely lost the ability to read a table, there are a number of results where the managed implementation is much faster than the native one. xargs.x1 being an example where ratio is almost as big as the kennedy.xls case, but in the other direction. It just impacts the overall numbers less because it's a smaller input file.

It seems that as long as you are only doing decompression, more often than not you are actually better off to use the managed implementation.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jul 8, 2016

Member

The results do seem to suffer from different implementations, and it does seem to make the "..the lzma native implementation is significantly faster." conclusion an unfair one.

Agreed. The earlier set of results that brought that conclusion were made using Eric's parameters, not the default parameters. As a result, we weren't comparing apples-to-apples. I'll strikethrough my original comment for clarity to people just coming into this thread.

We could do the work to make the C# implementation multi-threaded and measure the difference.

That would be interesting. I'll look into it. I expect it's a fairly large amount of effort, but for large files it would likely be worthwhile.

Member

ianhays commented Jul 8, 2016

The results do seem to suffer from different implementations, and it does seem to make the "..the lzma native implementation is significantly faster." conclusion an unfair one.

Agreed. The earlier set of results that brought that conclusion were made using Eric's parameters, not the default parameters. As a result, we weren't comparing apples-to-apples. I'll strikethrough my original comment for clarity to people just coming into this thread.

We could do the work to make the C# implementation multi-threaded and measure the difference.

That would be interesting. I'll look into it. I expect it's a fairly large amount of effort, but for large files it would likely be worthwhile.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jul 8, 2016

Member

The operations being multithreaded that we care about for perf are match finding - about ~30% of the time is spent there:

image

The native LZMA parallelizes these operations when multi-threading is enabled (which it is by default).

As far as getting C# to do the same, it would require a not unreasonable amount of effort. The C# match finder is essentially a rewrite of the C non-mt match finder, so it would follow that the C# multi-threaded match finder could logically succeed as a rewrite of the C multi-threaded match finder, at least as a baseline.

Member

ianhays commented Jul 8, 2016

The operations being multithreaded that we care about for perf are match finding - about ~30% of the time is spent there:

image

The native LZMA parallelizes these operations when multi-threading is enabled (which it is by default).

As far as getting C# to do the same, it would require a not unreasonable amount of effort. The C# match finder is essentially a rewrite of the C non-mt match finder, so it would follow that the C# multi-threaded match finder could logically succeed as a rewrite of the C multi-threaded match finder, at least as a baseline.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Jul 9, 2016

Compression should by default be single threaded. The threading architecture is not for a compression stream to decide. It's a useful facility to be able to turn this on, though.

There seem to be two multithreading techniques in LZMA(2): 1. The match finder which apparently does not support a high DOP. In the UI I think the DOP maxes out at 2. You can see this from the way memory usage increases when you increase the DOP. 2. Parallel compression of chunks. The latter one is infinitely scalable but should have an impact on ratio.

GSPP commented Jul 9, 2016

Compression should by default be single threaded. The threading architecture is not for a compression stream to decide. It's a useful facility to be able to turn this on, though.

There seem to be two multithreading techniques in LZMA(2): 1. The match finder which apparently does not support a high DOP. In the UI I think the DOP maxes out at 2. You can see this from the way memory usage increases when you increase the DOP. 2. Parallel compression of chunks. The latter one is infinitely scalable but should have an impact on ratio.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jul 13, 2016

Member

Compression should by default be single threaded. The threading architecture is not for a compression stream to decide. It's a useful facility to be able to turn this on, though.

That may be so, but the native LZMA is deciding. It is using 2 threads for match finding, at least on my machine in my tests. It would be useful to test it single-threaded to compare to C# on a more even ground - I'll try to find some time to do that.

There seem to be two multithreading techniques in LZMA(2): 1. The match finder which apparently does not support a high DOP. In the UI I think the DOP maxes out at 2. You can see this from the way memory usage increases when you increase the DOP. 2. Parallel compression of chunks. The latter one is infinitely scalable but should have an impact on ratio.

According to my tests, match finding being multi-threaded would give us the biggest "bang for our buck" so to speak to bridge the performance gap in the slower cases so I'd prioritize that first.

Member

ianhays commented Jul 13, 2016

Compression should by default be single threaded. The threading architecture is not for a compression stream to decide. It's a useful facility to be able to turn this on, though.

That may be so, but the native LZMA is deciding. It is using 2 threads for match finding, at least on my machine in my tests. It would be useful to test it single-threaded to compare to C# on a more even ground - I'll try to find some time to do that.

There seem to be two multithreading techniques in LZMA(2): 1. The match finder which apparently does not support a high DOP. In the UI I think the DOP maxes out at 2. You can see this from the way memory usage increases when you increase the DOP. 2. Parallel compression of chunks. The latter one is infinitely scalable but should have an impact on ratio.

According to my tests, match finding being multi-threaded would give us the biggest "bang for our buck" so to speak to bridge the performance gap in the slower cases so I'd prioritize that first.

@milkshakeuk

This comment has been minimized.

Show comment
Hide comment
@milkshakeuk

milkshakeuk Aug 9, 2016

@ianhays Hi, I was just wondering on the state of this issue? is there any pre-release/alpha/beta implementations that could be tested now?

milkshakeuk commented Aug 9, 2016

@ianhays Hi, I was just wondering on the state of this issue? is there any pre-release/alpha/beta implementations that could be tested now?

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Aug 9, 2016

Member

Hi, I was just wondering on the state of this issue? is there any pre-release/alpha/beta implementations that could be tested now?

Not quite. I had to switch off to work on some other things, but I'm hoping to get back to compression for 1.1 or 1.2. You're more than welcome to take a stab at it if you'd like :)

Member

ianhays commented Aug 9, 2016

Hi, I was just wondering on the state of this issue? is there any pre-release/alpha/beta implementations that could be tested now?

Not quite. I had to switch off to work on some other things, but I'm hoping to get back to compression for 1.1 or 1.2. You're more than welcome to take a stab at it if you'd like :)

@ianhays ianhays added the up-for-grabs label Aug 9, 2016

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Aug 25, 2016

Member

Probably the best way to tackle this would be to add the C# code Eric linked for LZMA to System.IO.Compression and modify the API to be stream-based and similar to DeflateStream. The default values should also be changed from what is in Eric's code. That would be the fastest and simplest way to get LZMA into System.IO.Compression and the initial benchmarks show it being comparable in performance to the more complicated (and difficult to deploy to other platforms) native implementation.

If anyone wants to play around with the perf, I've got the code for both managed and native implementations ported already in my fork

An official API suggestion would also be helpful to get things moving.

Member

ianhays commented Aug 25, 2016

Probably the best way to tackle this would be to add the C# code Eric linked for LZMA to System.IO.Compression and modify the API to be stream-based and similar to DeflateStream. The default values should also be changed from what is in Eric's code. That would be the fastest and simplest way to get LZMA into System.IO.Compression and the initial benchmarks show it being comparable in performance to the more complicated (and difficult to deploy to other platforms) native implementation.

If anyone wants to play around with the perf, I've got the code for both managed and native implementations ported already in my fork

An official API suggestion would also be helpful to get things moving.

@karelz karelz added enhancement and removed Feature labels Sep 18, 2016

@ianhays ianhays added api-needs-work and removed enhancement labels Oct 10, 2016

@ianhays ianhays self-assigned this Nov 22, 2016

@karelz

This comment has been minimized.

Show comment
Hide comment
@karelz

karelz Jan 30, 2017

Member

@ianhays are you working on it? (it's assigned to you) Or is it up for grabs?

Member

karelz commented Jan 30, 2017

@ianhays are you working on it? (it's assigned to you) Or is it up for grabs?

@ianhays ianhays removed their assignment Jan 30, 2017

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays Jan 30, 2017

Member

I'm not actively working on this, no. Any community members interested in this are welcome to take a stab at it following the instructions from my earlier comment. The most difficult part of adding support will be making the managed LZMA code play nicely with a stream interface like DeflateStream's.

Another route would be to utilize the work done already in one of the many open-source implementations of LZMA like SharpCompress rather than create our own. This is related in spirit to the discussion at #14853 where a Xamarin member suggested modifying ZipArchive to use libzip.

Member

ianhays commented Jan 30, 2017

I'm not actively working on this, no. Any community members interested in this are welcome to take a stab at it following the instructions from my earlier comment. The most difficult part of adding support will be making the managed LZMA code play nicely with a stream interface like DeflateStream's.

Another route would be to utilize the work done already in one of the many open-source implementations of LZMA like SharpCompress rather than create our own. This is related in spirit to the discussion at #14853 where a Xamarin member suggested modifying ZipArchive to use libzip.

@danmosemsft

This comment has been minimized.

Show comment
Hide comment
@danmosemsft

danmosemsft May 26, 2017

Member

Anyone else who wants this and would use it please thumb up the top post to help gauge interest.

Member

danmosemsft commented May 26, 2017

Anyone else who wants this and would use it please thumb up the top post to help gauge interest.

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock May 29, 2017

I would be interested in looking at pushing algorithms into the core and exposing more outside to help projects. As the author of SharpCompress, i like having a unified interface.

Another thing, working with non-seekable streams is a big thing so i cant use the current Zip implementation in the core. I'd like to get native zlib usage though so that's an example of something I'd like exposed.

The 7Zip format is awful and seekable only. But implementing LZip for LZMA natively might be good like GZip is. Probably BZip2 too.

adamhathcock commented May 29, 2017

I would be interested in looking at pushing algorithms into the core and exposing more outside to help projects. As the author of SharpCompress, i like having a unified interface.

Another thing, working with non-seekable streams is a big thing so i cant use the current Zip implementation in the core. I'd like to get native zlib usage though so that's an example of something I'd like exposed.

The 7Zip format is awful and seekable only. But implementing LZip for LZMA natively might be good like GZip is. Probably BZip2 too.

@karelz

This comment has been minimized.

Show comment
Hide comment
@karelz

karelz May 30, 2017

Member

Sounds interesting. @ianhays what do you think?

Member

karelz commented May 30, 2017

Sounds interesting. @ianhays what do you think?

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays May 30, 2017

Member

I would be interested in looking at pushing algorithms into the core and exposing more outside to help projects. As the author of SharpCompress, i like having a unified interface.

Yay!

The 7Zip format is awful and seekable only.

7z support is definitely lower on the totem pole than lzma. It'd be nice to get it 'for free', but is otherwise not worth a lot of additional effort until we get other things in place instead.

Sounds interesting. @ianhays what do you think?
Probably BZip2 too.

I recall chatting about bzip2 somewhere else... It's something we should eventually support.

I'd like to get native zlib usage though so that's an example of something I'd like exposed.

I'm not sure we'd want to publicly expose the native zlib functions - that's a bit lower level than the majority of people would need. You could always pinvoke directly into the 'clrcompression' yourself to get at deflateinit2, inflate, inflateend, etc, though it's technically unsupported.

Another thing, working with non-seekable streams is a big thing so i cant use the current Zip implementation in the core.

Specifically for Update mode? Or are you referring to potentially large streams for which you do not want to copy them entirely into memory when reading? Because I dislike the latter and would love to see a solution that didn't require that.

think I addressed all of your comment @adamhathcock... there's a lot of meat in there :)

Member

ianhays commented May 30, 2017

I would be interested in looking at pushing algorithms into the core and exposing more outside to help projects. As the author of SharpCompress, i like having a unified interface.

Yay!

The 7Zip format is awful and seekable only.

7z support is definitely lower on the totem pole than lzma. It'd be nice to get it 'for free', but is otherwise not worth a lot of additional effort until we get other things in place instead.

Sounds interesting. @ianhays what do you think?
Probably BZip2 too.

I recall chatting about bzip2 somewhere else... It's something we should eventually support.

I'd like to get native zlib usage though so that's an example of something I'd like exposed.

I'm not sure we'd want to publicly expose the native zlib functions - that's a bit lower level than the majority of people would need. You could always pinvoke directly into the 'clrcompression' yourself to get at deflateinit2, inflate, inflateend, etc, though it's technically unsupported.

Another thing, working with non-seekable streams is a big thing so i cant use the current Zip implementation in the core.

Specifically for Update mode? Or are you referring to potentially large streams for which you do not want to copy them entirely into memory when reading? Because I dislike the latter and would love to see a solution that didn't require that.

think I addressed all of your comment @adamhathcock... there's a lot of meat in there :)

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock May 30, 2017

I'm referring to dealing with large files, network streams or any other scenario where a stream is (or should be for perf) accessed in a forward-only matter. I have a reader/writer interface that allows for that.

If my code or something new was in the Core, then there wouldn't be the need to expose zlib directly for the above scenario.

There is a lot of meat in this stuff :) It would be cool if SharpCompress or a lot of the ideas were integrated.

I guess I should start learning how to do a local build of corefx.

adamhathcock commented May 30, 2017

I'm referring to dealing with large files, network streams or any other scenario where a stream is (or should be for perf) accessed in a forward-only matter. I have a reader/writer interface that allows for that.

If my code or something new was in the Core, then there wouldn't be the need to expose zlib directly for the above scenario.

There is a lot of meat in this stuff :) It would be cool if SharpCompress or a lot of the ideas were integrated.

I guess I should start learning how to do a local build of corefx.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays May 30, 2017

Member

I'm referring to dealing with large files, network streams or any other scenario where a stream is (or should be for perf) accessed in a forward-only matter. I have a reader/writer interface that allows for that.

I've been hearing requests for avoiding the memorystream copying for a while now. The big question is to how minimal the API code change could be and exactly how large the performance benefit is.

There is a lot of meat in this stuff :) It would be cool if SharpCompress or a lot of the ideas were integrated.

I absolutely agree! There's the ever-present question of what 'qualifies' to be in CoreFX and what should be relegated to a secondary opt-in library and that needs to be answered for any changes submitted, but I personally think System.IO.Compression has a long way to go and should have significantly wider algorithm support in-box to more readily support cross-platform scenarios. It would be good if we can chunk up the additions into as small of bits as possible to squeeze them through review more easily.

I guess I should start learning how to do a local build of corefx.

We've got some docs for getting started that we've been actively working on improving lately. Feel free to comment here or shoot me an email if you have any issues or suggestions :)

cc: This conversation shares some characteristics with the one in #14853 - potentially basing the ziparchive implementation around libzip.

Member

ianhays commented May 30, 2017

I'm referring to dealing with large files, network streams or any other scenario where a stream is (or should be for perf) accessed in a forward-only matter. I have a reader/writer interface that allows for that.

I've been hearing requests for avoiding the memorystream copying for a while now. The big question is to how minimal the API code change could be and exactly how large the performance benefit is.

There is a lot of meat in this stuff :) It would be cool if SharpCompress or a lot of the ideas were integrated.

I absolutely agree! There's the ever-present question of what 'qualifies' to be in CoreFX and what should be relegated to a secondary opt-in library and that needs to be answered for any changes submitted, but I personally think System.IO.Compression has a long way to go and should have significantly wider algorithm support in-box to more readily support cross-platform scenarios. It would be good if we can chunk up the additions into as small of bits as possible to squeeze them through review more easily.

I guess I should start learning how to do a local build of corefx.

We've got some docs for getting started that we've been actively working on improving lately. Feel free to comment here or shoot me an email if you have any issues or suggestions :)

cc: This conversation shares some characteristics with the one in #14853 - potentially basing the ziparchive implementation around libzip.

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock May 31, 2017

I've been hearing requests for avoiding the memorystream copying for a while now. The big question is to how minimal the API code change could be and exactly how large the performance benefit is.

From my experience, you can't do this with a random access API like ZipArchive is. In SharpCompress, there are two APIs, one for random access (Archives) and one for forward-only (Reader/Writer). With Reader/Writers, you basically update a foreach loop and deal with the file. If an entry isn't read, then it's skipped (if size is known) or decompressed to null (if size is unknown) depends on the archive format.

Maybe a new issue should be created around forward-only access. That is separate to #14853 which is about 3rd party and/or native usage of the zip format. Unfortunately, most implementations assume you have random-access so reimplementing might be required for forward-only.

Putting more compressors/formats (like LZMA, LZip, etc.) in is another issue. I'd like to have enough bits exposed to be able to do forward-only access in a 3rd party lib if that's the judgement.

adamhathcock commented May 31, 2017

I've been hearing requests for avoiding the memorystream copying for a while now. The big question is to how minimal the API code change could be and exactly how large the performance benefit is.

From my experience, you can't do this with a random access API like ZipArchive is. In SharpCompress, there are two APIs, one for random access (Archives) and one for forward-only (Reader/Writer). With Reader/Writers, you basically update a foreach loop and deal with the file. If an entry isn't read, then it's skipped (if size is known) or decompressed to null (if size is unknown) depends on the archive format.

Maybe a new issue should be created around forward-only access. That is separate to #14853 which is about 3rd party and/or native usage of the zip format. Unfortunately, most implementations assume you have random-access so reimplementing might be required for forward-only.

Putting more compressors/formats (like LZMA, LZip, etc.) in is another issue. I'd like to have enough bits exposed to be able to do forward-only access in a 3rd party lib if that's the judgement.

@qmfrederik

This comment has been minimized.

Show comment
Hide comment
@qmfrederik

qmfrederik May 31, 2017

Collaborator

@adamhathcock @ianhays We regularly interact with large zip files (decompressed size > 1 GB) and the current ZipArchive wasn't a good fit for us because of the memory consumption. Plus, the zip files that we generate need to comply with some additional requirements - for example, the content of some files need to be aligned on a 4-byte boundary and the files need to be sorted in a specific sort order (based on their file name) - thank you Android & iOS.

After using ZipArchive and SharpZipLib we ended up rolling our own zip library which meets our requirements, although it seems SharpCompress would probably be a fit too.

What we would really like to see is an API in .NET which exposes some of the lower level concepts related to zip archives. For example, one where you can manually read/write the central directory, local file header and file contents. I guess the existing ZipArchive class could sit on top of that.

Collaborator

qmfrederik commented May 31, 2017

@adamhathcock @ianhays We regularly interact with large zip files (decompressed size > 1 GB) and the current ZipArchive wasn't a good fit for us because of the memory consumption. Plus, the zip files that we generate need to comply with some additional requirements - for example, the content of some files need to be aligned on a 4-byte boundary and the files need to be sorted in a specific sort order (based on their file name) - thank you Android & iOS.

After using ZipArchive and SharpZipLib we ended up rolling our own zip library which meets our requirements, although it seems SharpCompress would probably be a fit too.

What we would really like to see is an API in .NET which exposes some of the lower level concepts related to zip archives. For example, one where you can manually read/write the central directory, local file header and file contents. I guess the existing ZipArchive class could sit on top of that.

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock May 31, 2017

@qmfrederik I think a SharpCompress API Reader would cover you. I know people doing your use case with SharpCompress. I'd still be interested to see what you did.

It doesn't seem they want to expose the lower level stuff, which makes sense. If the high level API covered you, you wouldn't need it.

adamhathcock commented May 31, 2017

@qmfrederik I think a SharpCompress API Reader would cover you. I know people doing your use case with SharpCompress. I'd still be interested to see what you did.

It doesn't seem they want to expose the lower level stuff, which makes sense. If the high level API covered you, you wouldn't need it.

@ianhays

This comment has been minimized.

Show comment
Hide comment
@ianhays

ianhays May 31, 2017

Member

Maybe a new issue should be created around forward-only access.

Good idea, since we've digressed conversation away from LZMA (this issue) a bit. Feel free to open up a new issue.

It doesn't seem they want to expose the lower level stuff, which makes sense. If the high level API covered you, you wouldn't need it.

Exactly :) That's not to say I'm completely against exposing the specifics of the archive format, but only as a last resort if we cannot resolve issues sufficiently with the high level API.

Member

ianhays commented May 31, 2017

Maybe a new issue should be created around forward-only access.

Good idea, since we've digressed conversation away from LZMA (this issue) a bit. Feel free to open up a new issue.

It doesn't seem they want to expose the lower level stuff, which makes sense. If the high level API covered you, you wouldn't need it.

Exactly :) That's not to say I'm completely against exposing the specifics of the archive format, but only as a last resort if we cannot resolve issues sufficiently with the high level API.

@simonegli8

This comment has been minimized.

Show comment
Hide comment
@simonegli8

simonegli8 Sep 29, 2017

I did an DeflateStream style implementation of LZMA for the MSPControl project. The source is here:

Streams.zip

It uses two Threads, to be able to compress in the background and write/read to a base source stream. For this it uses a PipeStream, that can read and write between Threads over a buffer. It has two classes, one System.IO.Compression.LzmaStream similar to DeflateStream and a class System.IO.Compression.CompressedStream where one can choose the algorithm from either Raw-Uncompressed, Deflate or LZMA.

simonegli8 commented Sep 29, 2017

I did an DeflateStream style implementation of LZMA for the MSPControl project. The source is here:

Streams.zip

It uses two Threads, to be able to compress in the background and write/read to a base source stream. For this it uses a PipeStream, that can read and write between Threads over a buffer. It has two classes, one System.IO.Compression.LzmaStream similar to DeflateStream and a class System.IO.Compression.CompressedStream where one can choose the algorithm from either Raw-Uncompressed, Deflate or LZMA.

@oldrev

This comment has been minimized.

Show comment
Hide comment
@oldrev

oldrev Dec 30, 2017

Hi @simonegli8 , I did a toy project to implement the 7-zip style multi-threaded encoding/decoding too!

In my opinion, the 7-zip way that separated coders and streams is more easy to use when you needs combines multiple coders like preprocessing/encryption/compression.

My program uses a customed stream connector to avoid memory allocation between streams.

MaltCompress

cheers!

oldrev commented Dec 30, 2017

Hi @simonegli8 , I did a toy project to implement the 7-zip style multi-threaded encoding/decoding too!

In my opinion, the 7-zip way that separated coders and streams is more easy to use when you needs combines multiple coders like preprocessing/encryption/compression.

My program uses a customed stream connector to avoid memory allocation between streams.

MaltCompress

cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment