Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Boolean encoding has inconsistent implemention #36939

Closed
mapleFU opened this issue Jul 30, 2023 · 2 comments · Fixed by #36972
Closed

[C++][Parquet] Boolean encoding has inconsistent implemention #36939

mapleFU opened this issue Jul 30, 2023 · 2 comments · Fixed by #36972

Comments

@mapleFU
Copy link
Member

mapleFU commented Jul 30, 2023

Describe the bug, including details regarding any error messages, version, and platform.

For PlainEncoder<BooleanType>:

  1. For Encoding without arrow:
  void PutSpaced(const bool* src, int num_values, const uint8_t* valid_bits,
                 int64_t valid_bits_offset) override {
    if (valid_bits != NULLPTR) {
      PARQUET_ASSIGN_OR_THROW(auto buffer, ::arrow::AllocateBuffer(num_values * sizeof(T),
                                                                   this->memory_pool()));
      T* data = reinterpret_cast<T*>(buffer->mutable_data());
      int num_valid_values = ::arrow::util::internal::SpacedCompress<T>(
          src, num_values, valid_bits, valid_bits_offset, data);
      Put(data, num_valid_values);
    } else {
      Put(src, num_values);
    }
  }

If values contains null, this only puts length for valid_bits.

  1. For Array
  void Put(const ::arrow::Array& values) override {
    if (values.type_id() != ::arrow::Type::BOOL) {
      throw ParquetException("direct put to boolean from " + values.type()->ToString() +
                             " not supported");
    }
    ...
    sink_.UnsafeAdvance(data.length());
  }

This will always output the length with null. These two implement is inconsistent.

Component(s)

C++, Parquet

@mapleFU
Copy link
Member Author

mapleFU commented Jul 30, 2023

cc @emkornfield @pitrou @wgtmac is this expected?

@mapleFU mapleFU changed the title [C++][Parquet] Boolean encoding has inconsistent implemention [C++][Parquet] Boolean encoding has more space than expected Aug 1, 2023
@mapleFU mapleFU changed the title [C++][Parquet] Boolean encoding has more space than expected [C++][Parquet] Boolean encoding has inconsistent implemention Aug 1, 2023
@mapleFU
Copy link
Member Author

mapleFU commented Aug 1, 2023

By the way, actually, sink_.UnsafeAdvance(data.length()); will allocate sum-of-array-length-include-null bytes. But finally, data should be encoded into sum-of-array-length-not-include-null bits.

pitrou added a commit that referenced this issue Aug 10, 2023
… called several times (#36972)

### Rationale for this change

This is from a bug in PLAIN encoding with `BooleanArray` input. Boolean will introduce bad length when writing arrow data.

This interface is not widely used.

### What changes are included in this PR?

Rewrite PLAIN boolean encoder to use `TypedBufferBuilder` instead of an incorrect hand-baked implementation.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No.

* Closes: #36939

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 14.0.0 milestone Aug 10, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…t when called several times (apache#36972)

### Rationale for this change

This is from a bug in PLAIN encoding with `BooleanArray` input. Boolean will introduce bad length when writing arrow data.

This interface is not widely used.

### What changes are included in this PR?

Rewrite PLAIN boolean encoder to use `TypedBufferBuilder` instead of an incorrect hand-baked implementation.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No.

* Closes: apache#36939

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants