Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Use RLE for boolean type by default when parquet version is 2.x #36882

Closed
wgtmac opened this issue Jul 26, 2023 · 26 comments · Fixed by #36955 or #38163
Closed

[C++][Parquet] Use RLE for boolean type by default when parquet version is 2.x #36882

wgtmac opened this issue Jul 26, 2023 · 26 comments · Fixed by #36955 or #38163
Assignees
Labels
Breaking Change Includes a breaking change to the API Component: C++ Component: Parquet Priority: Blocker Marks a blocker for the release Type: enhancement
Milestone

Comments

@wgtmac
Copy link
Member

wgtmac commented Jul 26, 2023

Describe the enhancement requested

As discussed in https://issues.apache.org/jira/browse/PARQUET-2222, it would be reasonable to switch the default to use RLE for boolean values if the parquet version is 2.x.

Component(s)

C++, Parquet

@mapleFU
Copy link
Member

mapleFU commented Jul 26, 2023

Emm change default encoding is a bit weird in arrow system, because when encoding is not set, it will always use PLAIN.

It's lucky that is when Dictionary enabled and fallback to boolean, we don't need to considering it because currently we don't allow dict boolean

@wgtmac
Copy link
Member Author

wgtmac commented Jul 26, 2023

I agree it is tricky. I thought it would be easy to make the code change but actually it is more dirty than I think.

@pitrou
Copy link
Member

pitrou commented Jul 26, 2023

Can you explain why it would be dirty?

@wgtmac
Copy link
Member Author

wgtmac commented Jul 26, 2023

Can you explain why it would be dirty?

As @mapleFU has explained, the default encoding in the WriterProperties is PLAIN for all columns. Adding an extra default encoding for boolean type require us to check 1) data page version, 2) column type, 3) user does not explicitly set an encoding in the WriterProperties. So the only chance to do this is in the WriterProperties::Builder::build(). We need to iterate all columns when data page version is V2 and replace encoding from PLAIN to RLE for all boolean columns which user does not provide an explicit encoding.

@pitrou
Copy link
Member

pitrou commented Jul 26, 2023

Ah, it's a pity it's inflexible like that.

But we could instead store a std::optional<Encoding::type> in WriterProperties (or Encoding::UNKNOWN, but that's less pretty). Then if the encoding hasn't been set to the user, the writer can choose its concrete value depending on the column type, and whether it's a v1 or v2 data page.

@wgtmac
Copy link
Member Author

wgtmac commented Jul 27, 2023

Yes, that's an option I have also considered. If you are good with std::optional<Encoding::type> approach, I can go ahead to implement it.

@pitrou
Copy link
Member

pitrou commented Jul 27, 2023

Yes, I think that would allow for more flexibility in the future.

@mapleFU
Copy link
Member

mapleFU commented Aug 1, 2023

I've seen a piece in arrow-rs:

// ----------------------------------------------------------------------
// Encoding support for column writer.
// This mirrors parquet-mr default encodings for writes. See:
// https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java
// https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java

/// Trait to define default encoding for types, including whether or not the type
/// supports dictionary encoding.
trait EncodingWriteSupport {
    /// Returns true if dictionary is supported for column writer, false otherwise.
    fn has_dictionary_support(props: &WriterProperties) -> bool;
}

/// Returns encoding for a column when no other encoding is provided in writer properties.
fn fallback_encoding(kind: Type, props: &WriterProperties) -> Encoding {
    match (kind, props.writer_version()) {
        (Type::BOOLEAN, WriterVersion::PARQUET_2_0) => Encoding::RLE,
        (Type::INT32, WriterVersion::PARQUET_2_0) => Encoding::DELTA_BINARY_PACKED,
        (Type::INT64, WriterVersion::PARQUET_2_0) => Encoding::DELTA_BINARY_PACKED,
        (Type::BYTE_ARRAY, WriterVersion::PARQUET_2_0) => Encoding::DELTA_BYTE_ARRAY,
        (Type::FIXED_LEN_BYTE_ARRAY, WriterVersion::PARQUET_2_0) => {
            Encoding::DELTA_BYTE_ARRAY
        }
        _ => Encoding::PLAIN,
    }
}

/// Returns true if dictionary is supported for column writer, false otherwise.
fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
    match (kind, props.writer_version()) {
        // Booleans do not support dict encoding and should use a fallback encoding.
        (Type::BOOLEAN, _) => false,
        // Dictionary encoding was not enabled in PARQUET 1.0
        (Type::FIXED_LEN_BYTE_ARRAY, WriterVersion::PARQUET_1_0) => false,
        (Type::FIXED_LEN_BYTE_ARRAY, WriterVersion::PARQUET_2_0) => true,
        _ => true,
    }
}

Should we make all of these consistent?

@wgtmac
Copy link
Member Author

wgtmac commented Aug 1, 2023

That sounds good. Let me check.

@wgtmac wgtmac changed the title [C++][Parquet] Use RLE for boolean type for DATA_PAGE_V2 by default [C++][Parquet] Use RLE for boolean type by default when parquet version is 2.x Aug 31, 2023
pitrou pushed a commit that referenced this issue Aug 31, 2023
…ersion 2.x (#36955)

### Rationale for this change

RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs.

### What changes are included in this PR?

* Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN).
* If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN.

* Closes: #36882

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 14.0.0 milestone Aug 31, 2023
@jorisvandenbossche jorisvandenbossche added the Priority: Blocker Marks a blocker for the release label Oct 9, 2023
@jorisvandenbossche
Copy link
Member

I reopened this issue, and temporarily added "Blocker" to it for the upcoming release, until we clarify for ourselves if we actually want this change or not.

From #38070 (comment), @mapleFU said:

🤔As we mentioned in https://issues.apache.org/jira/browse/PARQUET-2222 . I think export RLE is not a so good idea...

Rle is default enabled on Page V2, however, write page v2 is not recommended.

Page V2 format is great, however, most parquet implementions didn't get agreement on it. As a result, page v2 is unstable now.

RLE Boolean is default enabled on page v2, I think it's ok there, but I don't think is a good idea to default enable it.

Just as this https://blog.getdaft.io/p/working-with-the-apache-parquet-file blog saying. Parquet V2 is a ambigious naming. Although we (arrow and arrow-rs ) is using format 2.x and some properties on it. Most of implementions can still decode page v1.

So I think if user know what he/she is doing, RLE is ok to export to user, however I think here we can just hide it until PARQUET-2222 has a conclusion about this

@jorisvandenbossche
Copy link
Member

Rle is default enabled on Page V2, however, write page v2 is not recommended.

My understanding is that we also enabled it by default for DataPage V1 (we have a separate "parquet v2" which is enabled by default, and we decided to write RLE booleans based on that version, and not on DataPage V1 vs V2)

@wgtmac
Copy link
Member Author

wgtmac commented Oct 9, 2023

The parquet java implementation (namely parquet-mr) has mixed v2 features with data page v2. It means that user cannot use any v2 feature if data page version is not v2. However, I think the v2 implementation in parquet-mr is incomplete and not finalized. And the specs does not prohibit applying RLE to boolean values on data page v1.

@mapleFU
Copy link
Member

mapleFU commented Oct 9, 2023

@wgtmac I'm a bit confused that can we use RLE for Boolean on Data Page V1. I think we can decode this with arrow-rs impl, however I'm not sure parquet-mr can decode that, would you mind confirm that?

@wgtmac
Copy link
Member Author

wgtmac commented Oct 9, 2023

@wgtmac I'm a bit confused that can we use RLE for Boolean on Data Page V1. I think we can decode this with arrow-rs impl, however I'm not sure parquet-mr can decode that, would you mind confirm that?

I have verified the code here: https://issues.apache.org/jira/browse/PARQUET-2222?focusedCommentId=17746755&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17746755

@mapleFU
Copy link
Member

mapleFU commented Oct 9, 2023

Ah, that's my fault, I think we can keep same as arrow-rs, since it's able to read this kind of data.

🤔It's also interesting that when dictionary is enabled, Boolean will also write RLE

@mapleFU
Copy link
Member

mapleFU commented Oct 9, 2023

#38070 I've added 'RLE' here, also mentioned that use_dictionary doesn't work for boolean type.

@jorisvandenbossche
Copy link
Member

I think we can keep same as arrow-rs

My understanding from previous comments in this (or related) threads was that arrow-rs actually let's it depend (by default) on the datapage v1 vs v2, and so doesn't generally write RLE for v1 datapages (unless you ask for it).

So that's not the same behaviour as what we did here (and I don't know if that's good or bad, just pointing out so that we are clear)

@mapleFU
Copy link
Member

mapleFU commented Oct 9, 2023

@wgtmac should we also check DATA_PAGE version before when release 14.0.0? I think RLE boolean is great, but write it on DataPageV1 might raise some problem.

As you mentioned here: #36955 (comment) . arrow-rs only support WriterVersion, but it doesn't prevent from writing decimal or page index, it just control the fallback_encoding and DataPage version. It might uses RLE on DataPageV1, but it doesn't set it by default.

Personally I'm open to set the arbitary encoder if possible, since writer it's able to write page like that. But I think we can just also check the PageVersion rather than only check the Format Version?

@wgtmac
Copy link
Member Author

wgtmac commented Oct 9, 2023

I am open to any option. If data page version is preferred, we can solely depend on it. Or should we revert this commit first if there is no consensus yet to unblock the release?

@mapleFU
Copy link
Member

mapleFU commented Oct 9, 2023

How about:

diff --git a/cpp/src/parquet/column_writer.cc b/cpp/src/parquet/column_writer.cc
index be1881d00..88829ef5d 100644
--- a/cpp/src/parquet/column_writer.cc
+++ b/cpp/src/parquet/column_writer.cc
@@ -2336,7 +2336,8 @@ std::shared_ptr<ColumnWriter> ColumnWriter::Make(ColumnChunkMetaDataBuilder* met
   Encoding::type encoding = properties->encoding(descr->path());
   if (encoding == Encoding::UNKNOWN) {
     encoding = (descr->physical_type() == Type::BOOLEAN &&
-                properties->version() != ParquetVersion::PARQUET_1_0)
+                properties->version() != ParquetVersion::PARQUET_1_0 &&
+                properties->data_page_version() == ParquetDataPageVersion::V2)
:

@wgtmac
Copy link
Member Author

wgtmac commented Oct 9, 2023

How about:

diff --git a/cpp/src/parquet/column_writer.cc b/cpp/src/parquet/column_writer.cc
index be1881d00..88829ef5d 100644
--- a/cpp/src/parquet/column_writer.cc
+++ b/cpp/src/parquet/column_writer.cc
@@ -2336,7 +2336,8 @@ std::shared_ptr<ColumnWriter> ColumnWriter::Make(ColumnChunkMetaDataBuilder* met
   Encoding::type encoding = properties->encoding(descr->path());
   if (encoding == Encoding::UNKNOWN) {
     encoding = (descr->physical_type() == Type::BOOLEAN &&
-                properties->version() != ParquetVersion::PARQUET_1_0)
+                properties->version() != ParquetVersion::PARQUET_1_0 &&
+                properties->data_page_version() == ParquetDataPageVersion::V2)
:

This is one option, but still differs from arrow-rs, right? It also requires some additional work in the unit test.

@mapleFU
Copy link
Member

mapleFU commented Oct 9, 2023

Yes, here I just think that:

  1. Writing RLE on Boolean is only disabled in parquet-mr, self defined encoding will not be prevented
  2. Default encoding for most using still stay at PLAIN.

(However, I also think that write RLE Boolean is ok, the current code also looks good to me)

@pitrou Do you have some advice on this?

@mapleFU
Copy link
Member

mapleFU commented Oct 9, 2023

I've draft a impl for only enable RLE when only data page and version is V2: #38163

Feel free to merge or close it.

@jorisvandenbossche
Copy link
Member

I personally don't have a strong opinion on what default is best (I also don't know how much benefit RLE has over PLAIN for booleans in practice)

We do have some other cases where we already use "version 2" features by default (in combination with DataPage V1), but that might be mostly logical types, and not yet encodings.

@mapleFU
Copy link
Member

mapleFU commented Oct 10, 2023

@jorisvandenbossche I've written a benchmark, and reading RLE might get 10x faster than Plain( due to low performance of Plain implemention), but I think most user would not take boolean performance as so important.

However I think writing RLE boolean on data page v1 by default is a bit risky. We may enable this after we make it clear in parquet standard. (We're able to doing this in arrow-rs, but it's not done by default, and parquet-mr disallow this, so I think we can just disable it by default rather than disable writing it)

@jorisvandenbossche
Copy link
Member

Sounds good!

pitrou added a commit that referenced this issue Oct 10, 2023
…h data page and version is V2 (#38163)

### Rationale for this change

Only use RLE as BOOLEAN default encoding when data page is V2.

Previous patch ( #36955 ) set RLE encoding for Boolean type by default.  However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we:

1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust
2. If DataPage V1 is used, don't use RLE as default Boolean encoding.

### What changes are included in this PR?

Only use RLE as BOOLEAN default encoding when both data page and version is V2.

### Are these changes tested?

Yes

### Are there any user-facing changes?

RLE encoding change for Boolean.

* Closes: #36882

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…en both data page and version is V2 (apache#38163)

### Rationale for this change

Only use RLE as BOOLEAN default encoding when data page is V2.

Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default.  However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we:

1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust
2. If DataPage V1 is used, don't use RLE as default Boolean encoding.

### What changes are included in this PR?

Only use RLE as BOOLEAN default encoding when both data page and version is V2.

### Are these changes tested?

Yes

### Are there any user-facing changes?

RLE encoding change for Boolean.

* Closes: apache#36882

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…quet version 2.x (apache#36955)

### Rationale for this change

RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs.

### What changes are included in this PR?

* Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN).
* If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN.

* Closes: apache#36882

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…en both data page and version is V2 (apache#38163)

### Rationale for this change

Only use RLE as BOOLEAN default encoding when data page is V2.

Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default.  However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we:

1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust
2. If DataPage V1 is used, don't use RLE as default Boolean encoding.

### What changes are included in this PR?

Only use RLE as BOOLEAN default encoding when both data page and version is V2.

### Are these changes tested?

Yes

### Are there any user-facing changes?

RLE encoding change for Boolean.

* Closes: apache#36882

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@amoeba amoeba added the Breaking Change Includes a breaking change to the API label Nov 17, 2023
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…quet version 2.x (apache#36955)

### Rationale for this change

RLE is usually more efficient than PLAIN encoding for boolean columns, and it is already enabled by default in parquet-mr and arrow-rs.

### What changes are included in this PR?

* Slight breaking change in ColumnProperties to set default encoding to UNKNOWN (used to be PLAIN).
* If UNKNOWN is given, let the column writer decide the column encoding according to the selected Parquet format version and the column type.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes, default encoding of boolean type has been switched to RLE when the selected Parquet format version is at least 2.0 (the current default version is 2.6). It used to always be PLAIN.

* Closes: apache#36882

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…en both data page and version is V2 (apache#38163)

### Rationale for this change

Only use RLE as BOOLEAN default encoding when data page is V2.

Previous patch ( apache#36955 ) set RLE encoding for Boolean type by default.  However, parquet-cpp might write format v2 file with page v1 by default. This might cause parquet-cpp generating RLE encoding for boolean type by default. As https://issues.apache.org/jira/browse/PARQUET-2222 says, we still need some talks about that. So, we:

1. Still allow writing RLE on DataPage V2. This keeps same as parquet rust
2. If DataPage V1 is used, don't use RLE as default Boolean encoding.

### What changes are included in this PR?

Only use RLE as BOOLEAN default encoding when both data page and version is V2.

### Are these changes tested?

Yes

### Are there any user-facing changes?

RLE encoding change for Boolean.

* Closes: apache#36882

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment