Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Getting only 0 when reading DELTA_BINARY_PACKED #15052

Closed
Anieway opened this issue Dec 20, 2022 · 5 comments · Fixed by #15124
Closed

[C++][Parquet] Getting only 0 when reading DELTA_BINARY_PACKED #15052

Anieway opened this issue Dec 20, 2022 · 5 comments · Fixed by #15124
Assignees
Labels
Component: C++ Component: Parquet Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Milestone

Comments

@Anieway
Copy link

Anieway commented Dec 20, 2022

Describe the bug, including details regarding any error messages, version, and platform.

I have encountered a possible bug where when reading values one at a time with the low level API of parquet reader
(as done similarly in the example: /cpp/examples/parquet/low_level_api/reader_writer.cc)
in a row with DELTA_BINARY_PACKED encoding the results are all 0 regardless of file content.

aka this rows_read = int64_reader->ReadBatch(1, &definition_level, &repetition_level, &value, &values_read); gives a zero value
but this rows_read = int64_reader->ReadBatch(2, nullptr, nullptr, values, &values_read); gives the correct value.

So this seems not to occur when reading bigger batch-sizes (>1).

The problem might be somewhere within DeltaBitPackDecoder<DType>::GetInternal

Component(s)

Parquet

@rok
Copy link
Member

rok commented Dec 20, 2022

Interesting! Are you building from source?

@mapleFU
Copy link
Member

mapleFU commented Dec 29, 2022

Hi, Anieway, I submit a fixing for encoder before: #14959

And I cannot reproduce this bug now, my code is running on master, and my running method is:

index 09af32289..19a96dc40 100644
--- a/cpp/examples/parquet/low_level_api/reader_writer.cc
+++ b/cpp/examples/parquet/low_level_api/reader_writer.cc
@@ -64,6 +64,8 @@ int main(int argc, char** argv) {
     // Add writer properties
     parquet::WriterProperties::Builder builder;
     builder.compression(parquet::Compression::SNAPPY);
+    builder.disable_dictionary();
+    builder.encoding("int32_field", parquet::Encoding::DELTA_BINARY_PACKED);
+    builder.encoding("int64_field", parquet::Encoding::DELTA_BINARY_PACKED);
     std::shared_ptr<parquet::WriterProperties> props = builder.build();

Can you pull the lastest code to check if the problem still here? If still a bug, can you provide the minimal code to reproduce this bug?

@tachyonwill
Copy link
Contributor

I was able to reproduce. I believe the problem is

if (ARROW_PREDICT_FALSE(i == max_values)) break;

when requesting a single value at a time, GetInternal will never actually call InitBlock. Thus the only thing that will ever be written to the output is the first value from the header. The line should probably be
if (ARROW_PREDICT_FALSE(i == total_value_count_)) break;

@mapleFU
Copy link
Member

mapleFU commented Dec 30, 2022

Seems that my previous trying is failed because I use RelWithDebInfo, and it skip assert. Thanks @tachyonwill
I submit a naive fix here: #15124

@mapleFU
Copy link
Member

mapleFU commented Dec 30, 2022

By the way, can anyone add parquet and C++ tag for this issue? I have no permission on this project and this issue :) @rok @Anieway

pitrou added a commit that referenced this issue Jan 4, 2023
… only one value (#15124)

This patch trying to fix #15052 . The problem is mentioned here: #15052 (comment)

When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`.

Seems the problem is introduced in #10627 and amol-@d982bed

I will add some test tonight
* Closes: #15052

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 11.0.0 milestone Jan 4, 2023
EpsilonPrime pushed a commit to EpsilonPrime/arrow that referenced this issue Jan 5, 2023
…eading only one value (apache#15124)

This patch trying to fix apache#15052 . The problem is mentioned here: apache#15052 (comment)

When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`.

Seems the problem is introduced in apache#10627 and amol-@d982bed

I will add some test tonight
* Closes: apache#15052

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
EpsilonPrime pushed a commit to EpsilonPrime/arrow that referenced this issue Jan 5, 2023
…eading only one value (apache#15124)

This patch trying to fix apache#15052 . The problem is mentioned here: apache#15052 (comment)

When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`.

Seems the problem is introduced in apache#10627 and amol-@d982bed

I will add some test tonight
* Closes: apache#15052

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue Jan 9, 2023
…eading only one value (apache#15124)

This patch trying to fix apache#15052 . The problem is mentioned here: apache#15052 (comment)

When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`.

Seems the problem is introduced in apache#10627 and amol-@d982bed

I will add some test tonight
* Closes: apache#15052

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@wjones127 wjones127 added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: C++ Component: Parquet Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants