Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Zstandard to System.IO.Compression #59591

Open
Tracked by #62658
carlossanlop opened this issue Sep 25, 2021 · 14 comments
Open
Tracked by #62658

Add support for Zstandard to System.IO.Compression #59591

carlossanlop opened this issue Sep 25, 2021 · 14 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression
Milestone

Comments

@carlossanlop
Copy link
Member

carlossanlop commented Sep 25, 2021

Zstandard (or Zstd) is a fast compression algorithm that was published by Facebook in 2015, and had its first stable release in May 2021.

Their official repo offers a C implementation. https://github.com/facebook/zstd

Data compression mechanism specification: https://datatracker.ietf.org/doc/html/rfc8478

Features:

  • It is faster than Deflate, especially in decompression, while offering a similar compression ratio.
  • It's maximum compression level is similar to that of lzma and performs better than lza and bzip2.
  • It reached the Pareto Frontier, as it decompresses faster than any other currently-available algorithm with similar or worse compression ratio.
  • It supports multi-threading.
  • It can be saved to a *.zst file.
  • It has a dual BSD+GPLv2 license. We would be using the BSD license.

It's used by:

  • The Linux Kernel as a compression option for btrfs and SquashFS since 2017.
  • FreeBSD for coredumps.
  • AWS RedShift for databases.
  • Canonical, Fedora and ArchLinux for their package managers.
  • Nintendo Switch to compress its files.

We could offer a stream-based class, like we do for Deflate with DeflateStream or GZipStream, but we should also consider offering a stream-less static class, since it's a common request.

@carlossanlop carlossanlop added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression labels Sep 25, 2021
@carlossanlop carlossanlop added this to To do in System.IO - Compression via automation Sep 25, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Sep 25, 2021
@ghost
Copy link

ghost commented Sep 25, 2021

Tagging subscribers to this area: @dotnet/area-system-io-compression
See info in area-owners.md if you want to be subscribed.

Issue Details

Zstandard (or Zstd) is a fast compression algorithm that was published by Facebook in 2015, and had its first stable release in May 2021.

Their official repo offers a C implementation. https://github.com/facebook/zstd

Data compression mechanism specification: https://datatracker.ietf.org/doc/html/rfc8478

Features:

  • It is faster than Deflate, especially in decompression, while offering a similar compression ratio.
  • It's maximum compression level is similar to that of lzma and performs better than lza and bzip2.
  • It reached the Pareto Frontier, as it decompresses faster than any other currently-available algorithm with similar or worse compression ratio.
  • It supports multi-threading.
  • It can be saved to a *.zst file.
  • It has a dual BSD+GPLv2 license. We would be using the BSD license.

It's used by:

  • The Linux Kernel as a compression option for btrfs and SquashFS since 2017.
  • FreeBSD for coredumps.
  • AWS RedShift for databases.
  • Canonical, Fedora and ArchLinux for their package managers.
  • Nintendo Switch to compress its files.

We could offer a stream-based class, like we do for Deflate with DeflateStream, but we should also consider offering a stream-less static class, since it's a common request.

Author: carlossanlop
Assignees: -
Labels:

api-suggestion, area-System.IO.Compression

Milestone: -

@manandre
Copy link
Contributor

It would be a great enhancement for .Net, but also for the public visibility of this impressive compression algorithm.
If you accept it, I can contribute to make it happen.
I already foresee multiple steps:

  • Design and review API proposal
    • Stream-based API
    • Static (and allocation-less?) API: Compressor and Decompressor
  • Port Facebook Zstandard C sources
  • Add Zstandard entrypoints to System.IO.Compression.Native (Windows & Unix)
  • Implement approved API in a new System.IO.Compression.Zstandard library project
  • Add unit tests
    • StreamConformanceTests
    • CompressionStreamUnitTests
    • Custom Zstandard tests?
  • Add performance tests in dotnet/performance

Open questions:

  • Support of training dictionaries? (see Small Data compression)
  • Support of advanced parameters? (see zstd.h)
  • Mapping for CompressionLevel enum? Proposal:
    • CompressionLevel.Optimal => 0 // Default (=3 currently)
    • CompressionLevel.NoCompression => ZSTD_minCLevel() // ? Not supported but we have to chose a value here
    • CompressionLevel.Fastest => ZSTD_minCLevel()
    • CompressionLevel.SmallestSize => ZSTD_maxCLevel()

@carlossanlop carlossanlop removed the untriaged New issue has not been triaged by the area owner label Sep 30, 2021
@carlossanlop
Copy link
Member Author

Thank you, @manandre for your offer!

Let's start by discussing the stream API.

I think it makes sense for the stream class to look very similar to Deflate, since both would only wrap a compression algorithm (unlike the Zip, GZip, ZLib APIs, which additionally represent a compression/archiving format).

I am thinking we can avoid creating too many constructors by creating a separate ZStandardOptions class to specify the configuration values.

The ZStandardOptions class will allow specifying the compression level using an integer (and will throw if specifying an out-of-bounds value). This will help avoid falling into the typical CompressionLevel limitation of only 4 values. But, if the user desires to use it anyway, we can provide a constructor that takes a CompressionLevel and converts it to a predefined value from the compression level range allowed by ZStandard, which goes from 1 to 22, with 3 being default. The user should also be able to specify negative levels, according to the manual:

The library supports regular compression levels from 1 up to ZSTD_maxCLevel(), which is currently 22. Levels >= 20, labeled --ultra, should be used with caution, as they require more memory. The library also offers negative compression levels, which extend the range of speed vs. ratio preferences. The lower the level, the faster the speed (at the cost of compression).

Questions

  • WriteByte is a method that we decided to override in ZLibStream, but not in DeflateStream or GZipStream. Do we need to override it here?
  • Naming: Do we want to use the full name ZStandardStream|Options or do we prefer the shorter word ZstdStream|Options? I'm inclined for the first one.
  • The official C implementation specifies 3 as the default compression level, but other libraries have chosen a different value. For example, ZStandard.Net decided to use 6. Anyone has a reason not to use 3?
  • The namespace is System.IO.Compression. Do we wish to add ZStandard classes here as well, or should we create its own assembly like we did with Brotli? Say, System.IO.Compression.ZStandard? Why or why not?
  • If we allow the user to specify a negative compression level, should we ask the user to also manually specify --fast somehow, or should the class take care of that automatically?
  • The algorithm has multithreading support. Should it affect the public API surface proposed below?
  • The options constructor that takes a CompressionLevel may not play well with having a public settable property for int CompressionLevel. What if the user specifies a value for both?:
var options = new ZStandardOptions(level: CompressionLevel.SmallestSize) { Level = -5 };
  • What does the community think about adding ZStandard as an in-box feature in System.IO.Compression, as opposed to contributing to an external existing library? I don't yet see one targeting .NET Core or newer. A couple of examples:
    • bp74/Zstandard.Net, last updated 3 years ago. Targets .NET Standard 2.0 and .NET Framework 4.5.
      • skbkontur/ZstdNet, last updated 1 year ago. Targets .NET Standard 2.0 and 2.1. It's a fork of the previous one.
namespace System.IO.Compression
{
    public class ZStandardOptions
    {
        /// <summary>Allow mapping the CompressionLevel enum to predefined levels for ZStandard:
        /// - CompressionLevel.NoCompression = 1, // Official normal minimum
        /// - CompressionLevel.Fastest = 1,       // Official normal minimum
        /// - CompressionLevel.Optimal = 3,       // Official default: ZSTD_CLEVEL_DEFAULT
        /// - CompressionLevel.SmallestSize = 22  // Official maximum: ZSTD_MAX_CLEVEL
        /// </summary>
        public ZStandardOptions(CompressionLevel level);
        // Min = ZSTD_minCLevel() which can be negative, Max=ZSTD_maxCLevel()=22, Default=ZSTD_CLEVEL_DEFAULT=3, throw if out-of-bounds
        int CompressionLevel { get; set; }
        CompressionMode Mode { get; set; }
        bool LeaveOpen { get; set; }
        static int MaxCompressionLevel { get; } // P/Invoke for current maximum: 22
    }

    public class ZStandardStream : Stream
    {
        public ZStandardStream(Stream stream, ZStandardOptions? options); // If options null, then use default values
        public Stream BaseStream { get; }
        public override bool CanRead { get; }
        public override bool CanSeek { get; }
        public override bool CanWrite { get; }
        public override long Length { get; }
        public override long Position { get; set; }
        public override IAsyncResult BeginRead(byte[] buffer, int offset, int count, AsyncCallback? asyncCallback, object? asyncState);
        public override IAsyncResult BeginWrite(byte[] buffer, int offset, int count, AsyncCallback? asyncCallback, object? asyncState);
        public override void CopyTo(Stream destination, int bufferSize);
        public override Task CopyToAsync(Stream destination, int bufferSize, CancellationToken cancellationToken);
        protected override void Dispose(bool disposing);
        public override ValueTask DisposeAsync();
        public override int EndRead(IAsyncResult asyncResult);
        public override void EndWrite(IAsyncResult asyncResult);
        public override void Flush();
        public override Task FlushAsync(CancellationToken cancellationToken);
        public override int Read(byte[] buffer, int offset, int count);
        public override int Read(Span<byte> buffer);
        public override Task<int> ReadAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken);
        public override ValueTask<int> ReadAsync(Memory<byte> buffer, CancellationToken cancellationToken = default(CancellationToken));
        public override int ReadByte();
        public override long Seek(long offset, SeekOrigin origin);
        public override void SetLength(long value);
        public override void Write(byte[] buffer, int offset, int count);
        public override void Write(ReadOnlySpan<byte> buffer);
        public override void WriteByte(byte value); // ZLibStream overrides it, but not Deflate/GZipStream
        public override Task WriteAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken);
        public override ValueTask WriteAsync(ReadOnlyMemory<byte> buffer, CancellationToken cancellationToken = default(CancellationToken));
    }
}

@manandre
Copy link
Contributor

manandre commented Oct 4, 2021

  • WriteByte: ZlibStream and GZipStream overrides are delegated to DeflateStream which does not override it. But BrotliStream does override it to route it directly to the overridden Span-based implementation. We should do the same...
  • I vote for ZStandardStream|Options
  • 3 seems fine. For information, ZSTD_defaultCLevel() (returning ZSTD_CLEVEL_DEFAULT) is available since version 1.5.0.
  • Brotli classes are in the System.IO.Compression namespace but grouped in a dedicated System.IO.Compression.Brotli assembly. It seems the best compromise to make it easily accessible without forcing to load it in memory if not explicitly referenced.
  • We should automatically handle the full range of compression levels between (negative) min and max supported values. Why would we limit the algorithm possibilities here?
  • We should definitively propose multithreading support, we may propose its configuration in the ZStandardStream|Options, WorkerCount or MaxDegreeOfParallelism (as in ParallelOptions), with 0 as default value (or maybe Environment.ProcessorCount?).
  • CompressionLevel parameter in constructor should be optional, and, when set, native (int) compression level should always win.
  • What is the default value for CompressionMode? CompressionMode.Decompress?
  • What is the default value for LeaveOpen? false?
  • We could map Compression.Fastest to ZSTD_minCLevel() as "The lower the level, the faster the speed".
  • We could add MinCompressionLevel and DefaultCompressionLevel as static accessors aside the MaxCompressionLevel one.
  • What about a BufferSize configuration property (like in FileStreamOptions)? ZSTD_CStreamOutSize() and ZSTD_DStreamOutSize() could be used as the default value.

@jeffhandley jeffhandley added this to the 7.0.0 milestone Oct 4, 2021
@agocke
Copy link
Member

agocke commented Oct 20, 2021

FYI @VSadov this may be particularly interesting to single-file compression as it is supposed to be very fast for decompression.

This might mean we would need deeper runtime integration to be usable during bundler loading.

@GSPP
Copy link

GSPP commented Oct 21, 2021

How does the multi-threading work internally? Does it integrate somehow with the usual .NET infrastructure (TaskScheduler and such)? Or does the library start native threads?

I wonder about that because sometimes you need threading to play nice with what else lives in the same process. In a web app, multi-threading could cause load spikes that crowd out request work from the CPU. Reducing the DOP is only a partial fix because multiple parallel compression jobs would again saturate all cores and cause the problem to reappear. Isolating such work onto a custom thread pool can be a solution and it would not work if the library starts its own threads.

Another concern would be startup overhead for multi-threading inside the library. Is there thread pooling?


It seems to me that CompressionMode should be a mandatory constructor argument. There is no sensible default and without that argument the meaning of the code is unclear.

bool LeaveOpen is about the stream, not about compression. In my opinion, it does not belong into the options class. It should be a constructor argument specific for the stream. This option would, for example, not apply for a static helper method static byte[] Compress(byte[] data, ZStandardOptions? options). The options object would now carry around ignored options.

@manandre
Copy link
Contributor

About thread pooling, the zstd.h header file contains:

/* ! Thread pool :
 * These prototypes make it possible to share a thread pool among multiple compression contexts.
 * This can limit resources for applications with multiple threads where each one uses
 * a threaded compression mode (via ZSTD_c_nbWorkers parameter).
 * ZSTD_createThreadPool creates a new thread pool with a given number of threads.
 * Note that the lifetime of such pool must exist while being used.
 * ZSTD_CCtx_refThreadPool assigns a thread pool to a context (use NULL argument value
 * to use an internal thread pool).
 * ZSTD_freeThreadPool frees a thread pool, accepts NULL pointer.
 */
typedef struct POOL_ctx_s ZSTD_threadPool;
ZSTDLIB_API ZSTD_threadPool* ZSTD_createThreadPool(size_t numThreads);
ZSTDLIB_API void ZSTD_freeThreadPool (ZSTD_threadPool* pool);  /* accept NULL pointer */
ZSTDLIB_API size_t ZSTD_CCtx_refThreadPool(ZSTD_CCtx* cctx, ZSTD_threadPool* pool);

@VSadov
Copy link
Member

VSadov commented Oct 28, 2021

Zstandard would be very useful to single-file compression. We currently use ZLib/Deflate as it is available in the runtime, but would prefer something faster as impact of decompression is very noticeable at start up.

We did examine lz4 and Zstd as alternative choices of which lz4 is faster at decompression, but Zstd would allow to keep the same compression ratio as with Deflate.

If there is Zstd support in the runtime, single-file compression will definitely switch to it.

@GSPP
Copy link

GSPP commented Oct 29, 2021

Here are some interesting benchmarks: google/brotli#553. ZStandard offers a really nice trade-off for speed and compression ratio.

image

@iamcarbon
Copy link

iamcarbon commented Sep 5, 2023

It looks like Chrome may also be getting support for decoding zstd encoded content, making this also relevant to web / cloud scenarios.

https://chromestatus.com/feature/6186023867908096

Putting in my vote or support, and hoping to see this prioritized in the .NET 9.0 planning.

UPDATE: Chrome has confirmed that they are shipping zstd support in v123.

@manandre
Copy link
Contributor

I have open dotnet/aspnetcore#50643 to support the zstd Content-Encoding in ASP .NET Core.
It is currently considered as blocked by the support of the ZStandard compression in the .NET Runtime.
@carlossanlop Can we make it happen in .NET 9? Indeed I am still ready to help on this topic.

@alexandrehtrb
Copy link

+1

@dev-tony-hu
Copy link

Is there any plan to support it in Net 9.0?

@YohanSciubukgian
Copy link

Chrome 123 release support zstd

Could you consider it for .NET 9 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression
Development

No branches or pull requests

10 participants