Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Support RLE Encoder for Boolean type #15107

Closed
mapleFU opened this issue Dec 29, 2022 · 9 comments · Fixed by #34526
Closed

[C++][Parquet] Support RLE Encoder for Boolean type #15107

mapleFU opened this issue Dec 29, 2022 · 9 comments · Fixed by #34526

Comments

@mapleFU
Copy link
Member

mapleFU commented Dec 29, 2022

Describe the enhancement requested

Currently, in our spec, boolean type in data page can use RLE to encode itself. And parquet-mr, parquet-go and other versions all supports it.

Thanks #14147 , we already have a Rle Boolean Decoder. So for testing, just testing it local is ok.

I'd like to submit a implemention and testing this weekend.

Component(s)

C++, Parquet

@mapleFU mapleFU changed the title Support RLE Encoder for Boolean type [C++][Parquet] Support RLE Encoder for Boolean type Dec 29, 2022
@mapleFU
Copy link
Member Author

mapleFU commented Dec 29, 2022

Hey, I found an problem when trying to implement RLE for boolean.

  • Our Rle implement is in src/arrow/util/rle_encoding.h, the RleEncoder requires pre-compute the bytes it needs.
  • Current encoding uses LevelEncoder::Init and LevelEncoder::Encode or DictEncoderImpl::WriteIndices to write all data at once

So, should I:

  • Change rle_encoding to use a grow-able buffer?
  • Buffer all input in RleBooleanEncoder, and write it all-once?
  • Using other workaround?

Any idea here? (I can write a Buffer all input version first)

@rok
Copy link
Member

rok commented Dec 29, 2022

  • Buffer all input in RleBooleanEncoder, and write it all-once?

That would mean using upper bound buffer size for boolean and slicing off the unnecessary part after encoding? Sounds like a good idea. Is this what other implementations do?

@mapleFU
Copy link
Member Author

mapleFU commented Dec 29, 2022

That would mean using upper bound buffer size for boolean and slicing off the unnecessary part after encoding? Sounds like a good idea. Is this what other implementations do?

Maybe we're not able to guess a "upper bound buffer" on Encoder, I'd like to buffer values in PlainEncoder<BooleanType>, and when flushValues, acquire the buffer and put all of them into RleEncoder.

The parquet-mr uses CapacityByteArrayOutputStream in RunLengthBitPackingHybridEncoder, which is able to grow the buffer size.

I didn't find other implementions, seems that maybe people likes PLAIN Encoding? In Rust, parquet2 is not hybrid, seems it just implement bit-packing when encoding. Arrow-rs just uses a BitWriter, which is able to resize.

@rok

@mapleFU
Copy link
Member Author

mapleFU commented Jan 4, 2023

@pitrou @sfc-gh-nthimmegowda mind take a look? All change is ok for me. Maybe I can use PlainEncoder to cache all values and write to RleEncoder onces first, and later another patch to change to RleEncoder with resizable-buffer?

@pitrou
Copy link
Member

pitrou commented Jan 4, 2023

@mapleFU Take a look at what? Did you post a PR?

@mapleFU
Copy link
Member Author

mapleFU commented Jan 4, 2023

Take a look at what? Did you post a PR?

@pitrou Sorry, I'd like to post a PR but meet a problem. I want to make clear it before start writing the code. It's mentioned at #15107 (comment) . And I wonder should I:

  1. Cache all input in a PlainEncoder, and put them into RleEncoder once
  2. or make RleEncoder use a resizable-buffer
  3. using other work-arounds

If I make something wrong, please tell me

@pitrou
Copy link
Member

pitrou commented Jan 4, 2023

@mapleFU I'm sorry, but I'm afraid you'll need to do more research yourself and decide a reasonable solution.

That said, buffering all values before encoding doesn't sound very optimal...

@pitrou
Copy link
Member

pitrou commented Jan 4, 2023

Also, see https://issues.apache.org/jira/browse/PARQUET-2222

@mapleFU
Copy link
Member Author

mapleFU commented Mar 10, 2023

@pitrou I cache all values in bitmap firstly. Can take a look at #34526

@wjones127 wjones127 added this to the 12.0.0 milestone Mar 21, 2023
wjones127 pushed a commit that referenced this issue Mar 21, 2023
…4526)

Create RLE Encoder for Boolean.

### Rationale for this change

Boolean current can only use Plain Encoder, here it support RLE.

### What changes are included in this PR?

Create encoder

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes, user can use new kind of encoding.

* Closes: #15107

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue Mar 27, 2023
…an (apache#34526)

Create RLE Encoder for Boolean.

### Rationale for this change

Boolean current can only use Plain Encoder, here it support RLE.

### What changes are included in this PR?

Create encoder

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes, user can use new kind of encoding.

* Closes: apache#15107

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants