-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Expose the API to fine tune the window_bits parameter of ZLIB compression #35287
Comments
Maybe you can just the style of |
Hi, thanks for your quick reply! Do you mean we could add another parquet writer property (maybe can call it |
By the way, currently, arrow compression arguments are just level. Maybe we can support argument like rocksdb: https://github.com/facebook/rocksdb/blob/main/include/rocksdb/advanced_options.h#L87 |
Yes, it will give much more flexibility! Seems like rocksdb already support window_bits as an option, and many other arguments as well. Maybe I could help to draft a PR to add the window_bits option (or any other if needed) to Arrow as a kickoff? |
Personally I'm +1 for this, but please not break original interface a lot, maybe just change it firstly: std::unique_ptr<Codec> MakeGZipCodec(int compression_level, GZipFormat::type format) {
return std::make_unique<GZipCodec>(compression_level, format);
} And use |
Yes we would definitely not break the original interface. Change |
I'm ok on this. |
I'm OK with make I don't think that adding a new parameter I suggest that we discuss API on |
Given the variety of usage scenarios, it does be quite helpful to expose the capability and flexibility to applications to decide the codec parameter. |
Thanks for your suggestion @kou! I agree that introducing a new option class is a better design and will make the extension of the parameters much easier. And it's also true that it will be more complex to do such a refactor. Do I need to initialize a discussion on |
Just subscribe dev@arrow.apache.org and send a request here, maybe you can follow others' https://lists.apache.org/list.html?dev@arrow.apache.org |
Thanks! So I need to send an email to add me into the dev list right? Then start a discussion |
Yes, otherwise your mail would be blocked :) |
Got it, thanks! |
+1 for @kou's suggestion. It is not a good idea to add a specific parameter for configs that are only acceptable by a few codecs. As I have replied to the discussion in the mailing list, I assume the proposed |
Great suggestion and code reference. I think we could do something similar for C++ part. But instead of passing compression_level directly, we need to pass a CodecOption class when creating the Codec. |
When building std::shared_ptr<parquet::WriterProperties> props =
builder.compression(parquet::Compression::GZIP)
->compression_level(9)
->set_gzip_format(parquet::GZipFormat::DEFLATE)
->gzip_window_bits(12)
->build(); Another is to let the user create a CodecOption and pass the option to the builder directly:
|
(2) is much better for me. |
I'd vote for (2) and adopt CodecOption proposed by @pitrou in the mailing list. |
@wgtmac If we want it in mailiist, maybe we need a |
It would be a |
The mailing list discussion: https://lists.apache.org/thread/6gzmfhpfkflfhjmy3ws4y775tfx7g2f8 |
I prefer the latter too. I seems that we can use |
Thanks for all your suggestions! So I'd go with the latter one with |
…n parameter (#35886) ### Rationale for this change Based on #35287, we'd like to add a CodecOptions to make more compression parameters (such as window_bits) customizable when creating the Codec for parquet writer. Authored-by: Yang Yang [yang10.yang@ intel.com](mailto:yang10.yang@ intel.com) Co-authored-by: Rambacher, Mark [mark.rambacher@ intel.com](mailto:mark.rambacher@ intel.com) ### What changes are included in this PR? Add CodecOptions and replace `compression_level` when creating the Codec. The design is basically based on previous discussions. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, when user creates the `WriterProperties` * Closes: #35287 Lead-authored-by: yyang52 <yang10.yang@intel.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ression parameter (apache#35886) ### Rationale for this change Based on apache#35287, we'd like to add a CodecOptions to make more compression parameters (such as window_bits) customizable when creating the Codec for parquet writer. Authored-by: Yang Yang [yang10.yang@ intel.com](mailto:yang10.yang@ intel.com) Co-authored-by: Rambacher, Mark [mark.rambacher@ intel.com](mailto:mark.rambacher@ intel.com) ### What changes are included in this PR? Add CodecOptions and replace `compression_level` when creating the Codec. The design is basically based on previous discussions. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, when user creates the `WriterProperties` * Closes: apache#35287 Lead-authored-by: yyang52 <yang10.yang@intel.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the enhancement requested
ZLIB library supports history buffers of different sizes by setting windowBits parameter, while currently in Arrow it is set as a fixed value. It makes sense that setting the window size to the maximum number will give a better performance, while it does not provide too much flexibility. As we know there would be some scenarios where the memory efficiency is very limited, or the software stack is relatively old and do not have a large memory capacity. For those cases, the user may want to set the window bits to a small number to save some memory. At least it would be much more flexible to give users such an option to choose the window size when de/compressing. If that makes sense, we would like to add such a property for the user to set the ZLIB(GZIP) window_bits, and will ensure not affecting the existing function calls. Thanks!
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: