Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Expose the API to fine tune the window_bits parameter of ZLIB compression #35287

Closed
yyang52 opened this issue Apr 23, 2023 · 24 comments · Fixed by #35886
Closed

Comments

@yyang52
Copy link
Contributor

yyang52 commented Apr 23, 2023

Describe the enhancement requested

ZLIB library supports history buffers of different sizes by setting windowBits parameter, while currently in Arrow it is set as a fixed value. It makes sense that setting the window size to the maximum number will give a better performance, while it does not provide too much flexibility. As we know there would be some scenarios where the memory efficiency is very limited, or the software stack is relatively old and do not have a large memory capacity. For those cases, the user may want to set the window bits to a small number to save some memory. At least it would be much more flexible to give users such an option to choose the window size when de/compressing. If that makes sense, we would like to add such a property for the user to set the ZLIB(GZIP) window_bits, and will ensure not affecting the existing function calls. Thanks!

Component(s)

C++, Parquet

@mapleFU
Copy link
Member

mapleFU commented Apr 23, 2023

Maybe you can just the style of compression_level to pass the argument in?

@yyang52
Copy link
Contributor Author

yyang52 commented Apr 23, 2023

Maybe you can just the style of compression_level to pass the argument in?

Hi, thanks for your quick reply! Do you mean we could add another parquet writer property (maybe can call it window_bits) just like how compression_level passed in? That's also what we are trying to do now :)

@mapleFU
Copy link
Member

mapleFU commented Apr 23, 2023

By the way, currently, arrow compression arguments are just level. Maybe we can support argument like rocksdb: https://github.com/facebook/rocksdb/blob/main/include/rocksdb/advanced_options.h#L87

@yyang52
Copy link
Contributor Author

yyang52 commented Apr 23, 2023

By the way, currently, arrow compression arguments are just level. Maybe we can support argument like rocksdb: https://github.com/facebook/rocksdb/blob/main/include/rocksdb/advanced_options.h#L87

Yes, it will give much more flexibility! Seems like rocksdb already support window_bits as an option, and many other arguments as well. Maybe I could help to draft a PR to add the window_bits option (or any other if needed) to Arrow as a kickoff?

@mapleFU
Copy link
Member

mapleFU commented Apr 23, 2023

Personally I'm +1 for this, but please not break original interface a lot, maybe just change it firstly:

std::unique_ptr<Codec> MakeGZipCodec(int compression_level, GZipFormat::type format) {
  return std::make_unique<GZipCodec>(compression_level, format);
}

And use MakeGZipCodec in src/arrow/util/compression_internal.h firstly. Later we can extend it. Do you think this is ok?

@yyang52
Copy link
Contributor Author

yyang52 commented Apr 23, 2023

Personally I'm +1 for this, but please not break original interface a lot, maybe just change it firstly:

std::unique_ptr<Codec> MakeGZipCodec(int compression_level, GZipFormat::type format) {
  return std::make_unique<GZipCodec>(compression_level, format);
}

And use MakeGZipCodec in src/arrow/util/compression_internal.h firstly. Later we can extend it. Do you think this is ok?

Yes we would definitely not break the original interface. Change MakeGZipCodec is also what I want to do first! The only thing I need to double confirm is that since we want to pass the argument from user level to the function MakeGZipCodec, we may need to add another property similar to compression_level in parquet writer and pass it down to the compression_internal right? Then user can set the window_bits when creating the parquet writer :)

@mapleFU
Copy link
Member

mapleFU commented Apr 23, 2023

I'm ok on this.
@wjones127 @kou What's your opinion for adding a window_bits here?

@kou kou changed the title Expose the API to fine tune the window_bits parameter of ZLIB compression [C++][Parquet] Expose the API to fine tune the window_bits parameter of ZLIB compression Apr 23, 2023
@kou
Copy link
Member

kou commented Apr 23, 2023

I'm OK with make window_bits customizable.
But I don't think that calling MakeGZipCodec() from cpp/parquet/ directly is a good idea.
Currently, we use Codec::Create() instead of Make*Codec() directly in cpp/parquet/.

I don't think that adding a new parameter Codec::Create() such as Codec::Create(Compression::type codec, int compression_level, int window_bits) is a good idea.
If we choose this approach, we may need to add more parameters later.
I like that we introduce a new options class such as Codec::Options but it may be too complex.

I suggest that we discuss API on dev@arrow.apache.org too.

@george-gu-2021
Copy link

Given the variety of usage scenarios, it does be quite helpful to expose the capability and flexibility to applications to decide the codec parameter.

@yyang52
Copy link
Contributor Author

yyang52 commented Apr 23, 2023

I'm OK with make window_bits customizable. But I don't think that calling MakeGZipCodec() from cpp/parquet/ directly is a good idea. Currently, we use Codec::Create() instead of Make*Codec() directly in cpp/parquet/.

I don't think that adding a new parameter Codec::Create() such as Codec::Create(Compression::type codec, int compression_level, int window_bits) is a good idea. If we choose this approach, we may need to add more parameters later. I like that we introduce a new options class such as Codec::Options but it may be too complex.

I suggest that we discuss API on dev@arrow.apache.org too.

Thanks for your suggestion @kou! I agree that introducing a new option class is a better design and will make the extension of the parameters much easier. And it's also true that it will be more complex to do such a refactor. Do I need to initialize a discussion on dev@arrow.apache.org cause I have not used that before, thanks!

@mapleFU
Copy link
Member

mapleFU commented Apr 23, 2023

Just subscribe dev@arrow.apache.org and send a request here, maybe you can follow others' https://lists.apache.org/list.html?dev@arrow.apache.org

@yyang52
Copy link
Contributor Author

yyang52 commented Apr 23, 2023

Just subscribe dev@arrow.apache.org and send a request here, maybe you can follow others' https://lists.apache.org/list.html?dev@arrow.apache.org

Thanks! So I need to send an email to add me into the dev list right? Then start a discussion

@mapleFU
Copy link
Member

mapleFU commented Apr 23, 2023

Yes, otherwise your mail would be blocked :)

@yyang52
Copy link
Contributor Author

yyang52 commented Apr 23, 2023

Yes, otherwise your mail would be blocked :)

Got it, thanks!

@wgtmac
Copy link
Member

wgtmac commented Apr 23, 2023

+1 for @kou's suggestion. It is not a good idea to add a specific parameter for configs that are only acceptable by a few codecs. As I have replied to the discussion in the mailing list, I assume the proposed Codec::Options is more like the java.util.Properties with careful design of the supported keys.

@yyang52
Copy link
Contributor Author

yyang52 commented Apr 25, 2023

+1 for @kou's suggestion. It is not a good idea to add a specific parameter for configs that are only acceptable by a few codecs. As I have replied to the discussion in the mailing list, I assume the proposed Codec::Options is more like the java.util.Properties with careful design of the supported keys.

Great suggestion and code reference. I think we could do something similar for C++ part. But instead of passing compression_level directly, we need to pass a CodecOption class when creating the Codec.

@yyang52
Copy link
Contributor Author

yyang52 commented Apr 25, 2023

When building parquet::WriterProperties, there could be 2 design options: one is to pass compression parameters separately (as the current implementation, with more properties to be added) and make CodecOptions internally, for example:

 std::shared_ptr<parquet::WriterProperties> props =
          builder.compression(parquet::Compression::GZIP)
              ->compression_level(9)
              ->set_gzip_format(parquet::GZipFormat::DEFLATE)
              ->gzip_window_bits(12)
              ->build();

Another is to let the user create a CodecOption and pass the option to the builder directly:

CodecOption opt;
opt.compression_level = 9;
opt.window_bits = 12; 
std::shared_ptr<parquet::WriterProperties> props =
          builder.compression(parquet::Compression::GZIP)
              ->codec_option(opt)
              ->build();

Which one do you prefer? Thanks! @kou @wgtmac @mapleFU

@mapleFU
Copy link
Member

mapleFU commented Apr 25, 2023

(2) is much better for me.

@wgtmac
Copy link
Member

wgtmac commented Apr 25, 2023

I'd vote for (2) and adopt CodecOption proposed by @pitrou in the mailing list.

@mapleFU
Copy link
Member

mapleFU commented Apr 25, 2023

@wgtmac If we want it in mailiist, maybe we need a std::unique_ptr<CodecOption>?

@wgtmac
Copy link
Member

wgtmac commented Apr 25, 2023

@wgtmac If we want it in mailiist, maybe we need a std::unique_ptr<CodecOption>?

It would be a std::shared_ptr<CodecOption> or we cannot copy WriterProperties any more.

@kou
Copy link
Member

kou commented Apr 25, 2023

The mailing list discussion: https://lists.apache.org/thread/6gzmfhpfkflfhjmy3ws4y775tfx7g2f8

@kou
Copy link
Member

kou commented Apr 25, 2023

I prefer the latter too.
But I want to use CodecOptions not CodecOption because other existing options class has s.
For example, EqualOptions, ArithmeticOptions and so on.

I seems that we can use std::shared_ptr<CodecOptions> for this. Or we may want to copy CodecOptions instead of sharing it by std::shared_ptr.

@yyang52
Copy link
Contributor Author

yyang52 commented Apr 26, 2023

Thanks for all your suggestions! So I'd go with the latter one with std::shared_ptr<CodecOptions>

pitrou added a commit that referenced this issue Jul 11, 2023
…n parameter (#35886)

### Rationale for this change

Based on #35287, we'd like to add a CodecOptions to make more compression parameters (such as window_bits) customizable when creating the Codec for parquet writer.

Authored-by: Yang Yang [yang10.yang@ intel.com](mailto:yang10.yang@ intel.com)
Co-authored-by: Rambacher, Mark [mark.rambacher@ intel.com](mailto:mark.rambacher@ intel.com)

### What changes are included in this PR?

Add CodecOptions and replace `compression_level` when creating the Codec. The design is basically based on previous discussions. 
### Are these changes tested?

Yes
### Are there any user-facing changes?

Yes, when user creates the `WriterProperties`

* Closes: #35287

Lead-authored-by: yyang52 <yang10.yang@intel.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 13.0.0 milestone Jul 11, 2023
@raulcd raulcd modified the milestones: 13.0.0, 14.0.0 Jul 13, 2023
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this issue Aug 20, 2023
…ression parameter (apache#35886)

### Rationale for this change

Based on apache#35287, we'd like to add a CodecOptions to make more compression parameters (such as window_bits) customizable when creating the Codec for parquet writer.

Authored-by: Yang Yang [yang10.yang@ intel.com](mailto:yang10.yang@ intel.com)
Co-authored-by: Rambacher, Mark [mark.rambacher@ intel.com](mailto:mark.rambacher@ intel.com)

### What changes are included in this PR?

Add CodecOptions and replace `compression_level` when creating the Codec. The design is basically based on previous discussions. 
### Are these changes tested?

Yes
### Are there any user-facing changes?

Yes, when user creates the `WriterProperties`

* Closes: apache#35287

Lead-authored-by: yyang52 <yang10.yang@intel.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment