New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to make magic number optional #591

Closed
indygreg opened this Issue Mar 7, 2017 · 7 comments

Comments

Projects
None yet
3 participants
@indygreg
Contributor

indygreg commented Mar 7, 2017

The frame header's 4 byte magic number header can be a substantial size of a full frame when doing dictionary compression on small data. In the case of the Firefox Mercurial repository, there are ~3M "chunks" that need compressed. That would produce ~12MB of zstd magic numbers in what is currently ~1800 MB of zlib-compressed data. That's ~0.67%, which feels high. In some cases, the 4 byte magic number is the difference between zstd (without dictionary compression) being smaller than zlib.

I know there is a block API that can be used to bypass the zstd frame header. But that requires a bit too much wheel reinvention for me. The frame header is generally useful and aside from the magic number I think it is reasonably designed in terms of not wasting space.

I'd like to float the idea for making the magic number optional in both the compressor and decompressor so advanced consumers (who care about size) can omit it while having what is a near fully-functional frame. If this were implemented, I imagine it would be built on top of ZSTD_setCCtxParameter() and a to-be-invented ZSTD_setDDctxParameter(). I anticipate the cost of checking this flag to have a negligible impact on performance.

Are you receptive to this feature request?

@Cyan4973

This comment has been minimized.

Show comment
Hide comment
@Cyan4973

Cyan4973 Mar 7, 2017

Contributor

I'm mostly worried about the complexity.
Not on the implementation, it's simple enough, but on the presentation side.
Up to now, adding a new feature required adding a new prototype, or a new field, etc.
This is exploding complexity.

We want to keep the interface manageable.
As you have already inferred, we are preparing a new setParameter logic.
The new logic is not yet settled, let alone done. But we are working on that.

Once it's in place, new feature, such as this request, will be easier to present.
So it will be time to consider its implementation.

Contributor

Cyan4973 commented Mar 7, 2017

I'm mostly worried about the complexity.
Not on the implementation, it's simple enough, but on the presentation side.
Up to now, adding a new feature required adding a new prototype, or a new field, etc.
This is exploding complexity.

We want to keep the interface manageable.
As you have already inferred, we are preparing a new setParameter logic.
The new logic is not yet settled, let alone done. But we are working on that.

Once it's in place, new feature, such as this request, will be easier to present.
So it will be time to consider its implementation.

@indygreg

This comment has been minimized.

Show comment
Hide comment
@indygreg

indygreg Mar 8, 2017

Contributor

I'm tentatively fine waiting until a setParameter API settles before this is implemented. Whatever happens, Mercurial support for compression engines is extensible. So worst case we initially support the canonical frame with the magic number then introduce a one-off compression engine later that has it optional. That's slightly annoying, but not a deal breaker.

Contributor

indygreg commented Mar 8, 2017

I'm tentatively fine waiting until a setParameter API settles before this is implemented. Whatever happens, Mercurial support for compression engines is extensible. So worst case we initially support the canonical frame with the magic number then introduce a one-off compression engine later that has it optional. That's slightly annoying, but not a deal breaker.

@Cyan4973

This comment has been minimized.

Show comment
Hide comment
@Cyan4973

Cyan4973 Jun 23, 2017

Contributor

There's some news on this front.

With latest dev branch update, we have a new "advanced API", which is supposed to take over all other ones, to reach "stable" status one day, after sufficient testing time :
https://github.com/facebook/zstd/blob/dev/lib/zstd.h#L863

This new API features ZSTD_CCtx_setParameter() , which is the generic mechanism by which we will introduce the "noMagic" mode requested here.

The API has a larger scope though, and answers the following 2 main issues :

  • We have an expanding set of functions in "experimental" section, as a consequence of having an exploding number of combinations possible : compress from buffer, use streaming mode, use dictionary, use digested dictionary, control advanced parameters, etc.
    The new API is supposed to provide all these capabilities using a single prototype, ZSTD_compress_generic().
  • The previous _advanced() API was based around a ZSTD_parameters structure. This is a source of concerns regarding ABI stability in the future, should we need to change / introduce new parameters. It's been replaced by an enum based policy and ZSTD_CCtx_setParameter(), which allow us to introduce new parameters and new values without breaking ABI. It feels more future-proof, though it comes with its own set of risks (namely, if enum values change between 2 versions), and that's why all enum values will have to be hard-coded before moving to "stable".

Anyway, this is still early days, and we have ample time to collect feedbacks, and modify the interface when needs be. This API will remain in "experimental" section for some releases, and join "stable" status when deemed ready.

Contributor

Cyan4973 commented Jun 23, 2017

There's some news on this front.

With latest dev branch update, we have a new "advanced API", which is supposed to take over all other ones, to reach "stable" status one day, after sufficient testing time :
https://github.com/facebook/zstd/blob/dev/lib/zstd.h#L863

This new API features ZSTD_CCtx_setParameter() , which is the generic mechanism by which we will introduce the "noMagic" mode requested here.

The API has a larger scope though, and answers the following 2 main issues :

  • We have an expanding set of functions in "experimental" section, as a consequence of having an exploding number of combinations possible : compress from buffer, use streaming mode, use dictionary, use digested dictionary, control advanced parameters, etc.
    The new API is supposed to provide all these capabilities using a single prototype, ZSTD_compress_generic().
  • The previous _advanced() API was based around a ZSTD_parameters structure. This is a source of concerns regarding ABI stability in the future, should we need to change / introduce new parameters. It's been replaced by an enum based policy and ZSTD_CCtx_setParameter(), which allow us to introduce new parameters and new values without breaking ABI. It feels more future-proof, though it comes with its own set of risks (namely, if enum values change between 2 versions), and that's why all enum values will have to be hard-coded before moving to "stable".

Anyway, this is still early days, and we have ample time to collect feedbacks, and modify the interface when needs be. This API will remain in "experimental" section for some releases, and join "stable" status when deemed ready.

@indygreg

This comment has been minimized.

Show comment
Hide comment
@indygreg

indygreg Jun 23, 2017

Contributor

That looks like a very useful API!

Contributor

indygreg commented Jun 23, 2017

That looks like a very useful API!

@LuxInno

This comment has been minimized.

Show comment
Hide comment
@LuxInno

LuxInno Jul 20, 2017

Any estimates on when the "noMagic" parameter will be available?

LuxInno commented Jul 20, 2017

Any estimates on when the "noMagic" parameter will be available?

@Cyan4973

This comment has been minimized.

Show comment
Hide comment
@Cyan4973

Cyan4973 Jul 21, 2017

Contributor

I can try to add that in the next release

Contributor

Cyan4973 commented Jul 21, 2017

I can try to add that in the next release

Cyan4973 added a commit that referenced this issue Sep 28, 2017

@Cyan4973

This comment has been minimized.

Show comment
Hide comment
@Cyan4973

Cyan4973 Sep 28, 2017

Contributor

We finally added this requested capability in dev branch update.

It can be accessed through new advanced API :
there is a new parameter, ZSTD_p_format, and a new type ZSTD_format_e, which can be used to pass this instruction via ZSTD_CCtx_setParameter() :
https://github.com/facebook/zstd/blob/dev/lib/zstd.h#L958

While the generation of such magic-less format is easy, it was more complex to support on the decoder side.
To this end was created a new advanced API for the decoder too, which allows passing the expected data format via ZSTD_CCtx_setFormat() : https://github.com/facebook/zstd/blob/dev/lib/zstd.h#L1304

Without this instruction, the decoder will fail, as it expects the magic number.
This little change incurred a bit more code modifications than expected on the decoder side, but it's done now, and works fine.

Contributor

Cyan4973 commented Sep 28, 2017

We finally added this requested capability in dev branch update.

It can be accessed through new advanced API :
there is a new parameter, ZSTD_p_format, and a new type ZSTD_format_e, which can be used to pass this instruction via ZSTD_CCtx_setParameter() :
https://github.com/facebook/zstd/blob/dev/lib/zstd.h#L958

While the generation of such magic-less format is easy, it was more complex to support on the decoder side.
To this end was created a new advanced API for the decoder too, which allows passing the expected data format via ZSTD_CCtx_setFormat() : https://github.com/facebook/zstd/blob/dev/lib/zstd.h#L1304

Without this instruction, the decoder will fail, as it expects the magic number.
This little change incurred a bit more code modifications than expected on the decoder side, but it's done now, and works fine.

@Cyan4973 Cyan4973 closed this Sep 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment