-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Use RLE for boolean type by default when parquet version is 2.x #36882
Comments
Emm change default encoding is a bit weird in arrow system, because when encoding is not set, it will always use PLAIN. It's lucky that is when |
I agree it is tricky. I thought it would be easy to make the code change but actually it is more dirty than I think. |
Can you explain why it would be dirty? |
As @mapleFU has explained, the default encoding in the WriterProperties is |
Ah, it's a pity it's inflexible like that. But we could instead store a |
Yes, that's an option I have also considered. If you are good with |
Yes, I think that would allow for more flexibility in the future. |
I've seen a piece in // ----------------------------------------------------------------------
// Encoding support for column writer.
// This mirrors parquet-mr default encodings for writes. See:
// https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java
// https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java
/// Trait to define default encoding for types, including whether or not the type
/// supports dictionary encoding.
trait EncodingWriteSupport {
/// Returns true if dictionary is supported for column writer, false otherwise.
fn has_dictionary_support(props: &WriterProperties) -> bool;
}
/// Returns encoding for a column when no other encoding is provided in writer properties.
fn fallback_encoding(kind: Type, props: &WriterProperties) -> Encoding {
match (kind, props.writer_version()) {
(Type::BOOLEAN, WriterVersion::PARQUET_2_0) => Encoding::RLE,
(Type::INT32, WriterVersion::PARQUET_2_0) => Encoding::DELTA_BINARY_PACKED,
(Type::INT64, WriterVersion::PARQUET_2_0) => Encoding::DELTA_BINARY_PACKED,
(Type::BYTE_ARRAY, WriterVersion::PARQUET_2_0) => Encoding::DELTA_BYTE_ARRAY,
(Type::FIXED_LEN_BYTE_ARRAY, WriterVersion::PARQUET_2_0) => {
Encoding::DELTA_BYTE_ARRAY
}
_ => Encoding::PLAIN,
}
}
/// Returns true if dictionary is supported for column writer, false otherwise.
fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
match (kind, props.writer_version()) {
// Booleans do not support dict encoding and should use a fallback encoding.
(Type::BOOLEAN, _) => false,
// Dictionary encoding was not enabled in PARQUET 1.0
(Type::FIXED_LEN_BYTE_ARRAY, WriterVersion::PARQUET_1_0) => false,
(Type::FIXED_LEN_BYTE_ARRAY, WriterVersion::PARQUET_2_0) => true,
_ => true,
}
} Should we make all of these consistent? |
That sounds good. Let me check. |
…ersion 2.x (#36955) ### Rationale for this change RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs. ### What changes are included in this PR? * Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN). * If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN. * Closes: #36882 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
I reopened this issue, and temporarily added "Blocker" to it for the upcoming release, until we clarify for ourselves if we actually want this change or not. From #38070 (comment), @mapleFU said:
|
My understanding is that we also enabled it by default for DataPage V1 (we have a separate "parquet v2" which is enabled by default, and we decided to write RLE booleans based on that version, and not on DataPage V1 vs V2) |
The parquet java implementation (namely parquet-mr) has mixed v2 features with data page v2. It means that user cannot use any v2 feature if data page version is not v2. However, I think the v2 implementation in parquet-mr is incomplete and not finalized. And the specs does not prohibit applying RLE to boolean values on data page v1. |
@wgtmac I'm a bit confused that can we use RLE for Boolean on Data Page V1. I think we can decode this with arrow-rs impl, however I'm not sure parquet-mr can decode that, would you mind confirm that? |
I have verified the code here: https://issues.apache.org/jira/browse/PARQUET-2222?focusedCommentId=17746755&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17746755 |
Ah, that's my fault, I think we can keep same as arrow-rs, since it's able to read this kind of data. 🤔It's also interesting that when dictionary is enabled, Boolean will also write RLE |
#38070 I've added 'RLE' here, also mentioned that |
My understanding from previous comments in this (or related) threads was that arrow-rs actually let's it depend (by default) on the datapage v1 vs v2, and so doesn't generally write RLE for v1 datapages (unless you ask for it). So that's not the same behaviour as what we did here (and I don't know if that's good or bad, just pointing out so that we are clear) |
@wgtmac should we also check DATA_PAGE version before when release 14.0.0? I think RLE boolean is great, but write it on DataPageV1 might raise some problem. As you mentioned here: #36955 (comment) . arrow-rs only support Personally I'm open to set the arbitary encoder if possible, since writer it's able to write page like that. But I think we can just also check the PageVersion rather than only check the Format Version? |
I am open to any option. If data page version is preferred, we can solely depend on it. Or should we revert this commit first if there is no consensus yet to unblock the release? |
How about: diff --git a/cpp/src/parquet/column_writer.cc b/cpp/src/parquet/column_writer.cc
index be1881d00..88829ef5d 100644
--- a/cpp/src/parquet/column_writer.cc
+++ b/cpp/src/parquet/column_writer.cc
@@ -2336,7 +2336,8 @@ std::shared_ptr<ColumnWriter> ColumnWriter::Make(ColumnChunkMetaDataBuilder* met
Encoding::type encoding = properties->encoding(descr->path());
if (encoding == Encoding::UNKNOWN) {
encoding = (descr->physical_type() == Type::BOOLEAN &&
- properties->version() != ParquetVersion::PARQUET_1_0)
+ properties->version() != ParquetVersion::PARQUET_1_0 &&
+ properties->data_page_version() == ParquetDataPageVersion::V2)
: |
This is one option, but still differs from arrow-rs, right? It also requires some additional work in the unit test. |
Yes, here I just think that:
(However, I also think that write RLE Boolean is ok, the current code also looks good to me) @pitrou Do you have some advice on this? |
I've draft a impl for only enable RLE when only data page and version is V2: #38163 Feel free to merge or close it. |
I personally don't have a strong opinion on what default is best (I also don't know how much benefit RLE has over PLAIN for booleans in practice) We do have some other cases where we already use "version 2" features by default (in combination with DataPage V1), but that might be mostly logical types, and not yet encodings. |
@jorisvandenbossche I've written a benchmark, and reading RLE might get 10x faster than Plain( due to low performance of Plain implemention), but I think most user would not take boolean performance as so important. However I think writing RLE boolean on data page v1 by default is a bit risky. We may enable this after we make it clear in parquet standard. (We're able to doing this in arrow-rs, but it's not done by default, and parquet-mr disallow this, so I think we can just disable it by default rather than disable writing it) |
Sounds good! |
…h data page and version is V2 (#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( #36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: #36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…quet version 2.x (apache#36955) ### Rationale for this change RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs. ### What changes are included in this PR? * Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN). * If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN. * Closes: apache#36882 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…quet version 2.x (apache#36955) ### Rationale for this change RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs. ### What changes are included in this PR? * Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN). * If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN. * Closes: apache#36882 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the enhancement requested
As discussed in https://issues.apache.org/jira/browse/PARQUET-2222, it would be reasonable to switch the default to use RLE for boolean values if the parquet version is 2.x.
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: