[C++][Parquet] Use RLE for boolean type by default when parquet version is 2.x #36882

wgtmac · 2023-07-26T07:35:27Z

Describe the enhancement requested

As discussed in https://issues.apache.org/jira/browse/PARQUET-2222, it would be reasonable to switch the default to use RLE for boolean values if the parquet version is 2.x.

Component(s)

C++, Parquet

mapleFU · 2023-07-26T07:53:01Z

Emm change default encoding is a bit weird in arrow system, because when encoding is not set, it will always use PLAIN.

It's lucky that is when Dictionary enabled and fallback to boolean, we don't need to considering it because currently we don't allow dict boolean

wgtmac · 2023-07-26T15:40:21Z

I agree it is tricky. I thought it would be easy to make the code change but actually it is more dirty than I think.

pitrou · 2023-07-26T15:44:10Z

Can you explain why it would be dirty?

wgtmac · 2023-07-26T15:53:41Z

Can you explain why it would be dirty?

As @mapleFU has explained, the default encoding in the WriterProperties is PLAIN for all columns. Adding an extra default encoding for boolean type require us to check 1) data page version, 2) column type, 3) user does not explicitly set an encoding in the WriterProperties. So the only chance to do this is in the WriterProperties::Builder::build(). We need to iterate all columns when data page version is V2 and replace encoding from PLAIN to RLE for all boolean columns which user does not provide an explicit encoding.

pitrou · 2023-07-26T16:07:40Z

Ah, it's a pity it's inflexible like that.

But we could instead store a std::optional<Encoding::type> in WriterProperties (or Encoding::UNKNOWN, but that's less pretty). Then if the encoding hasn't been set to the user, the writer can choose its concrete value depending on the column type, and whether it's a v1 or v2 data page.

wgtmac · 2023-07-27T01:43:02Z

Yes, that's an option I have also considered. If you are good with std::optional<Encoding::type> approach, I can go ahead to implement it.

pitrou · 2023-07-27T06:54:05Z

Yes, I think that would allow for more flexibility in the future.

mapleFU · 2023-08-01T04:58:04Z

I've seen a piece in arrow-rs:

// ----------------------------------------------------------------------
// Encoding support for column writer.
// This mirrors parquet-mr default encodings for writes. See:
// https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java
// https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java

/// Trait to define default encoding for types, including whether or not the type
/// supports dictionary encoding.
trait EncodingWriteSupport {
    /// Returns true if dictionary is supported for column writer, false otherwise.
    fn has_dictionary_support(props: &WriterProperties) -> bool;
}

/// Returns encoding for a column when no other encoding is provided in writer properties.
fn fallback_encoding(kind: Type, props: &WriterProperties) -> Encoding {
    match (kind, props.writer_version()) {
        (Type::BOOLEAN, WriterVersion::PARQUET_2_0) => Encoding::RLE,
        (Type::INT32, WriterVersion::PARQUET_2_0) => Encoding::DELTA_BINARY_PACKED,
        (Type::INT64, WriterVersion::PARQUET_2_0) => Encoding::DELTA_BINARY_PACKED,
        (Type::BYTE_ARRAY, WriterVersion::PARQUET_2_0) => Encoding::DELTA_BYTE_ARRAY,
        (Type::FIXED_LEN_BYTE_ARRAY, WriterVersion::PARQUET_2_0) => {
            Encoding::DELTA_BYTE_ARRAY
        }
        _ => Encoding::PLAIN,
    }
}

/// Returns true if dictionary is supported for column writer, false otherwise.
fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
    match (kind, props.writer_version()) {
        // Booleans do not support dict encoding and should use a fallback encoding.
        (Type::BOOLEAN, _) => false,
        // Dictionary encoding was not enabled in PARQUET 1.0
        (Type::FIXED_LEN_BYTE_ARRAY, WriterVersion::PARQUET_1_0) => false,
        (Type::FIXED_LEN_BYTE_ARRAY, WriterVersion::PARQUET_2_0) => true,
        _ => true,
    }
}

Should we make all of these consistent?

wgtmac · 2023-08-01T05:03:08Z

That sounds good. Let me check.

…ersion 2.x (#36955) ### Rationale for this change RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs. ### What changes are included in this PR? * Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN). * If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN. * Closes: #36882 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

jorisvandenbossche · 2023-10-09T14:41:18Z

I reopened this issue, and temporarily added "Blocker" to it for the upcoming release, until we clarify for ourselves if we actually want this change or not.

From #38070 (comment), @mapleFU said:

🤔As we mentioned in https://issues.apache.org/jira/browse/PARQUET-2222 . I think export RLE is not a so good idea...

Rle is default enabled on Page V2, however, write page v2 is not recommended.

Page V2 format is great, however, most parquet implementions didn't get agreement on it. As a result, page v2 is unstable now.

RLE Boolean is default enabled on page v2, I think it's ok there, but I don't think is a good idea to default enable it.

Just as this https://blog.getdaft.io/p/working-with-the-apache-parquet-file blog saying. Parquet V2 is a ambigious naming. Although we (arrow and arrow-rs ) is using format 2.x and some properties on it. Most of implementions can still decode page v1.

So I think if user know what he/she is doing, RLE is ok to export to user, however I think here we can just hide it until PARQUET-2222 has a conclusion about this

jorisvandenbossche · 2023-10-09T14:42:31Z

Rle is default enabled on Page V2, however, write page v2 is not recommended.

My understanding is that we also enabled it by default for DataPage V1 (we have a separate "parquet v2" which is enabled by default, and we decided to write RLE booleans based on that version, and not on DataPage V1 vs V2)

wgtmac · 2023-10-09T14:57:12Z

The parquet java implementation (namely parquet-mr) has mixed v2 features with data page v2. It means that user cannot use any v2 feature if data page version is not v2. However, I think the v2 implementation in parquet-mr is incomplete and not finalized. And the specs does not prohibit applying RLE to boolean values on data page v1.

mapleFU · 2023-10-09T14:59:36Z

@wgtmac I'm a bit confused that can we use RLE for Boolean on Data Page V1. I think we can decode this with arrow-rs impl, however I'm not sure parquet-mr can decode that, would you mind confirm that?

wgtmac · 2023-10-09T15:04:42Z

@wgtmac I'm a bit confused that can we use RLE for Boolean on Data Page V1. I think we can decode this with arrow-rs impl, however I'm not sure parquet-mr can decode that, would you mind confirm that?

I have verified the code here: https://issues.apache.org/jira/browse/PARQUET-2222?focusedCommentId=17746755&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17746755

mapleFU · 2023-10-09T15:07:06Z

Ah, that's my fault, I think we can keep same as arrow-rs, since it's able to read this kind of data.

🤔It's also interesting that when dictionary is enabled, Boolean will also write RLE

mapleFU · 2023-10-09T15:12:49Z

#38070 I've added 'RLE' here, also mentioned that use_dictionary doesn't work for boolean type.

jorisvandenbossche · 2023-10-09T15:17:42Z

I think we can keep same as arrow-rs

My understanding from previous comments in this (or related) threads was that arrow-rs actually let's it depend (by default) on the datapage v1 vs v2, and so doesn't generally write RLE for v1 datapages (unless you ask for it).

So that's not the same behaviour as what we did here (and I don't know if that's good or bad, just pointing out so that we are clear)

mapleFU · 2023-10-09T15:21:35Z

@wgtmac should we also check DATA_PAGE version before when release 14.0.0? I think RLE boolean is great, but write it on DataPageV1 might raise some problem.

As you mentioned here: #36955 (comment) . arrow-rs only support WriterVersion, but it doesn't prevent from writing decimal or page index, it just control the fallback_encoding and DataPage version. It might uses RLE on DataPageV1, but it doesn't set it by default.

Personally I'm open to set the arbitary encoder if possible, since writer it's able to write page like that. But I think we can just also check the PageVersion rather than only check the Format Version?

wgtmac · 2023-10-09T15:23:33Z

I am open to any option. If data page version is preferred, we can solely depend on it. Or should we revert this commit first if there is no consensus yet to unblock the release?

mapleFU · 2023-10-09T15:36:40Z

How about:

diff --git a/cpp/src/parquet/column_writer.cc b/cpp/src/parquet/column_writer.cc
index be1881d00..88829ef5d 100644
--- a/cpp/src/parquet/column_writer.cc
+++ b/cpp/src/parquet/column_writer.cc
@@ -2336,7 +2336,8 @@ std::shared_ptr<ColumnWriter> ColumnWriter::Make(ColumnChunkMetaDataBuilder* met
   Encoding::type encoding = properties->encoding(descr->path());
   if (encoding == Encoding::UNKNOWN) {
     encoding = (descr->physical_type() == Type::BOOLEAN &&
-                properties->version() != ParquetVersion::PARQUET_1_0)
+                properties->version() != ParquetVersion::PARQUET_1_0 &&
+                properties->data_page_version() == ParquetDataPageVersion::V2)
:

wgtmac · 2023-10-09T15:38:48Z

How about:

diff --git a/cpp/src/parquet/column_writer.cc b/cpp/src/parquet/column_writer.cc
index be1881d00..88829ef5d 100644
--- a/cpp/src/parquet/column_writer.cc
+++ b/cpp/src/parquet/column_writer.cc
@@ -2336,7 +2336,8 @@ std::shared_ptr<ColumnWriter> ColumnWriter::Make(ColumnChunkMetaDataBuilder* met
   Encoding::type encoding = properties->encoding(descr->path());
   if (encoding == Encoding::UNKNOWN) {
     encoding = (descr->physical_type() == Type::BOOLEAN &&
-                properties->version() != ParquetVersion::PARQUET_1_0)
+                properties->version() != ParquetVersion::PARQUET_1_0 &&
+                properties->data_page_version() == ParquetDataPageVersion::V2)
:

This is one option, but still differs from arrow-rs, right? It also requires some additional work in the unit test.

mapleFU · 2023-10-09T15:44:48Z

Yes, here I just think that:

Writing RLE on Boolean is only disabled in parquet-mr, self defined encoding will not be prevented
Default encoding for most using still stay at PLAIN.

(However, I also think that write RLE Boolean is ok, the current code also looks good to me)

@pitrou Do you have some advice on this?

mapleFU · 2023-10-09T16:37:54Z

I've draft a impl for only enable RLE when only data page and version is V2: #38163

Feel free to merge or close it.

jorisvandenbossche · 2023-10-10T13:37:05Z

I personally don't have a strong opinion on what default is best (I also don't know how much benefit RLE has over PLAIN for booleans in practice)

We do have some other cases where we already use "version 2" features by default (in combination with DataPage V1), but that might be mostly logical types, and not yet encodings.

mapleFU · 2023-10-10T13:43:30Z

@jorisvandenbossche I've written a benchmark, and reading RLE might get 10x faster than Plain( due to low performance of Plain implemention), but I think most user would not take boolean performance as so important.

However I think writing RLE boolean on data page v1 by default is a bit risky. We may enable this after we make it clear in parquet standard. (We're able to doing this in arrow-rs, but it's not done by default, and parquet-mr disallow this, so I think we can just disable it by default rather than disable writing it)

jorisvandenbossche · 2023-10-10T13:54:29Z

Sounds good!

…h data page and version is V2 (#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( #36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: #36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

…quet version 2.x (apache#36955) ### Rationale for this change RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs. ### What changes are included in this PR? * Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN). * If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN. * Closes: apache#36882 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

…quet version 2.x (apache#36955) ### Rationale for this change RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs. ### What changes are included in this PR? * Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN). * If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN. * Closes: apache#36882 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

…en both data page and version is V2 (apache#38163) ### Rationale for this change Only use RLE as BOOLEAN default encoding when data page is V2. Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default. However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we: 1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust 2. If DataPage V1 is used, don't use RLE as default Boolean encoding. ### What changes are included in this PR? Only use RLE as BOOLEAN default encoding when both data page and version is V2. ### Are these changes tested? Yes ### Are there any user-facing changes? RLE encoding change for Boolean. * Closes: apache#36882 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

wgtmac added the Type: enhancement label Jul 26, 2023

github-actions bot added Component: Parquet Component: C++ labels Jul 26, 2023

wgtmac added a commit to wgtmac/arrow that referenced this issue Jul 31, 2023

apacheGH-36882: [C++][Parquet] Default RLE for bool values in v2 pages

2c6aeca

github-actions bot mentioned this issue Jul 31, 2023

GH-36882: [C++][Parquet] Default RLE for bool values in the parquet version 2.x #36955

Merged

github-actions bot assigned wgtmac Jul 31, 2023

wgtmac added a commit to wgtmac/arrow that referenced this issue Aug 15, 2023

apacheGH-36882: [C++][Parquet] Default RLE for bool values in v2 pages

25546ee

wgtmac changed the title ~~[C++][Parquet] Use RLE for boolean type for DATA_PAGE_V2 by default~~ [C++][Parquet] Use RLE for boolean type by default when parquet version is 2.x Aug 31, 2023

pitrou closed this as completed in #36955 Aug 31, 2023

pitrou added this to the 14.0.0 milestone Aug 31, 2023

jorisvandenbossche mentioned this issue Oct 9, 2023

GH-37312: [Python][Docs] Update Python docstrings to reflect new parquet encoding option #38070

Merged

jorisvandenbossche reopened this Oct 9, 2023

jorisvandenbossche added the Priority: Blocker Marks a blocker for the release label Oct 9, 2023

github-actions bot mentioned this issue Oct 9, 2023

GH-36882: [C++][Parquet] Use RLE as BOOLEAN default encoding when both data page and version is V2 #38163

Merged

pitrou closed this as completed in #38163 Oct 10, 2023

amoeba added the Breaking Change Includes a breaking change to the API label Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Use RLE for boolean type by default when parquet version is 2.x #36882

[C++][Parquet] Use RLE for boolean type by default when parquet version is 2.x #36882

wgtmac commented Jul 26, 2023 •

edited

Loading

mapleFU commented Jul 26, 2023

wgtmac commented Jul 26, 2023

pitrou commented Jul 26, 2023

wgtmac commented Jul 26, 2023

pitrou commented Jul 26, 2023

wgtmac commented Jul 27, 2023

pitrou commented Jul 27, 2023

mapleFU commented Aug 1, 2023

wgtmac commented Aug 1, 2023

jorisvandenbossche commented Oct 9, 2023

jorisvandenbossche commented Oct 9, 2023

wgtmac commented Oct 9, 2023

mapleFU commented Oct 9, 2023

wgtmac commented Oct 9, 2023

mapleFU commented Oct 9, 2023

mapleFU commented Oct 9, 2023

jorisvandenbossche commented Oct 9, 2023

mapleFU commented Oct 9, 2023 •

edited

Loading

wgtmac commented Oct 9, 2023

mapleFU commented Oct 9, 2023

wgtmac commented Oct 9, 2023

mapleFU commented Oct 9, 2023 •

edited

Loading

mapleFU commented Oct 9, 2023

jorisvandenbossche commented Oct 10, 2023

mapleFU commented Oct 10, 2023 •

edited

Loading

jorisvandenbossche commented Oct 10, 2023

[C++][Parquet] Use RLE for boolean type by default when parquet version is 2.x #36882

[C++][Parquet] Use RLE for boolean type by default when parquet version is 2.x #36882

Comments

wgtmac commented Jul 26, 2023 • edited Loading

Describe the enhancement requested

Component(s)

mapleFU commented Jul 26, 2023

wgtmac commented Jul 26, 2023

pitrou commented Jul 26, 2023

wgtmac commented Jul 26, 2023

pitrou commented Jul 26, 2023

wgtmac commented Jul 27, 2023

pitrou commented Jul 27, 2023

mapleFU commented Aug 1, 2023

wgtmac commented Aug 1, 2023

jorisvandenbossche commented Oct 9, 2023

jorisvandenbossche commented Oct 9, 2023

wgtmac commented Oct 9, 2023

mapleFU commented Oct 9, 2023

wgtmac commented Oct 9, 2023

mapleFU commented Oct 9, 2023

mapleFU commented Oct 9, 2023

jorisvandenbossche commented Oct 9, 2023

mapleFU commented Oct 9, 2023 • edited Loading

wgtmac commented Oct 9, 2023

mapleFU commented Oct 9, 2023

wgtmac commented Oct 9, 2023

mapleFU commented Oct 9, 2023 • edited Loading

mapleFU commented Oct 9, 2023

jorisvandenbossche commented Oct 10, 2023

mapleFU commented Oct 10, 2023 • edited Loading

jorisvandenbossche commented Oct 10, 2023

wgtmac commented Jul 26, 2023 •

edited

Loading

mapleFU commented Oct 9, 2023 •

edited

Loading

mapleFU commented Oct 9, 2023 •

edited

Loading

mapleFU commented Oct 10, 2023 •

edited

Loading