Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Regression reading byte-stream-split encoded floats with null values in Arrow 16.0.0 #41562

Closed
adamreeve opened this issue May 6, 2024 · 4 comments

Comments

@adamreeve
Copy link
Contributor

adamreeve commented May 6, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Write byte-stream-split encoded floats containing null values:

import pyarrow as pa
import pyarrow.parquet as pq

num_rows = 10
xs = pa.array(
        [None if i % 10 == 5 else (i / 3.14) for i in range(num_rows)],
        type=pa.float32())

table = pa.Table.from_arrays([xs], names=['x'])
pq.write_table(
        table, 'data.parquet',
        use_byte_stream_split=True,
        use_dictionary=False)

And then attempt to read the data back:

import pyarrow as pa
import pyarrow.parquet as pq

table = pq.read_table('data.parquet')
xs = table['x']

num_rows = 10
assert len(xs) == num_rows
for i in range(num_rows):
    value = xs[i]
    if i % 10 == 5:
        assert not value.is_valid
    else:
        assert value.is_valid
        assert value.equals(pa.scalar(i / 3.14, type=pa.float32()))

The above code works with pyarrow 15.0.2 but fails with pyarrow 16.0.0 with the following exception:

Traceback (most recent call last):
  File "/home/adam/dev/parquet-issues/null-byte-stream-split-regression/read_data.py", line 3, in <module>
    table = pq.read_table('data.parquet')
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adam/dev/virtualenvs/ml/lib64/python3.12/site-packages/pyarrow/parquet/core.py", line 1811, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adam/dev/virtualenvs/ml/lib64/python3.12/site-packages/pyarrow/parquet/core.py", line 1454, in read
    table = self._dataset.to_table(
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Data size (36) does not match number of values in BYTE_STREAM_SPLIT (10)

Writing the data with pyarrow 15.0.2 and reading with pyarrow 16.0.0 also fails, but writing with 16.0.0 and reading with 15.0.2 works fine. Disabling byte stream split encoding or not writing any nulls also makes the error go away.

This looks related to #28737 although the error there was quite different.

Component(s)

C++, Parquet

@jorisvandenbossche jorisvandenbossche added Priority: Blocker Marks a blocker for the release Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. backport-candidate and removed Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. labels May 7, 2024
@jorisvandenbossche
Copy link
Member

@adamreeve thanks a lot for the report. Can confirm this locally with latest main as well (on Linux / Ubuntu).

cc @mapleFU @pitrou

@mapleFU
Copy link
Member

mapleFU commented May 7, 2024

I get the reason here, will fix it soon

@mapleFU
Copy link
Member

mapleFU commented May 7, 2024

I submit a basic bugfix. I'm a bit busy on worktime, will add test after 9pm in utc-8 when I back home

pitrou pushed a commit that referenced this issue May 7, 2024
…amSplitDecoder (#41565)

### Rationale for this change

This problem is raised from  #40094 . Original bug fixed here: #34140 , but this is corrupt in #40094 .

### What changes are included in this PR?

Refine checking

### Are these changes tested?

* [x] Will add

### Are there any user-facing changes?

Bugfix

* GitHub Issue: #41562

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou
Copy link
Member

pitrou commented May 7, 2024

Issue resolved by pull request 41565
#41565

@pitrou pitrou added this to the 17.0.0 milestone May 7, 2024
@pitrou pitrou closed this as completed May 7, 2024
@jorisvandenbossche jorisvandenbossche modified the milestones: 17.0.0, 16.1.0 May 7, 2024
raulcd pushed a commit that referenced this issue May 8, 2024
…amSplitDecoder (#41565)

### Rationale for this change

This problem is raised from  #40094 . Original bug fixed here: #34140 , but this is corrupt in #40094 .

### What changes are included in this PR?

Refine checking

### Are these changes tested?

* [x] Will add

### Are there any user-facing changes?

Bugfix

* GitHub Issue: #41562

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…teStreamSplitDecoder (apache#41565)

### Rationale for this change

This problem is raised from  apache#40094 . Original bug fixed here: apache#34140 , but this is corrupt in apache#40094 .

### What changes are included in this PR?

Refine checking

### Are these changes tested?

* [x] Will add

### Are there any user-facing changes?

Bugfix

* GitHub Issue: apache#41562

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants