Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0 #35423

Closed
adamreeve opened this issue May 4, 2023 · 4 comments · Fixed by #35428
Assignees
Labels
Component: Parquet Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Priority: Critical Type: bug
Milestone

Comments

@adamreeve
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

Arrow 12.0.0 has a regression where it can crash when reading byte-stream split encoded data written by itself or older versions of Arrow:

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

x = pa.array(np.linspace(0.0, 1.0, 1_000_000), type=pa.float32())
table = pa.Table.from_arrays([x], names=['x'])
pq.write_table(table, 'data.parquet', use_dictionary=False, use_byte_stream_split=True)

table = pq.read_table('data.parquet')
print(table)

This crashes with:

Traceback (most recent call last):
  File "/home/.../write_read_data.py", line 9, in <module>
	table = pq.read_table('data.parquet')
			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py", line 2986, in read_table
	return dataset.read(columns=columns, use_threads=use_threads,
		   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py", line 2614, in read
	table = self._dataset.to_table(
			^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3449, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Data size too large for number of values (padding in byte stream split data page?)

But the above code works fine with pyarrow 11.0.0 and and 10.0.1.

It appears that #34140 caused this regression. I tested building pyarrow on the current main branch (commit 42d42b1) and could reproduce the error, but it was fixed after I reverted the merge of that PR (commit c31fb46).

Component(s)

Parquet

@adamreeve
Copy link
Contributor Author

adamreeve commented May 4, 2023

If I run this in a debugger, I can see ByteStreamSplitDecoder<DType>::SetData is called 4 times. For the first 3 times, num_values is 262,144 and len is 1,048,576, then on the 4th time len is again 1,048,576 but num_values is only 213,568 (for a total of 1,000,000 values).

If I only write 262,144 values, or a multiple of that, then everything works. But writing 262,145 values causes the crash.

From reading some of the context in #15173, it looks like the assumption that the encoders don't add any padding was incorrect?

@mapleFU
Copy link
Member

mapleFU commented May 4, 2023

I'll try to reproduce this problem. In ByteStreamSplitDecoder::SetData, it expects that num_values is greater or equals to the num of values in ByteStreamSplit, so it expects num_values * static_cast<int64_t>(sizeof(T)) >= len. It would be == if there is no null value in the page

I've reproduce this issue using PyArrow 12.0. I'll find out why.

@mapleFU
Copy link
Member

mapleFU commented May 4, 2023

Seems that using ParquetTableReader to just read that file is ok, and size as 100_000 or 200_000 is ok, but 300_000 would failed. I guess ParquetFileFormat with a huge page size is not ok.

@mapleFU
Copy link
Member

mapleFU commented May 4, 2023

I've found out the reason, I think it's not a bug in ByteStreamSplit, it's caused by page.size(), page reuse one buffer when ReadNextPage, and it size might be greater than expect size( page.uncompressed_size() ). However, ByteStreamSplit could expose that problem because it checks the boundary

@raulcd raulcd changed the title "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0 [C++][Parquet] "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0 May 4, 2023
@wjones127 wjones127 added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label May 9, 2023
@pitrou pitrou modified the milestones: 12.0.1, 13.0.0 May 15, 2023
pitrou pushed a commit that referenced this issue May 15, 2023
…r resize smaller (#35428)

### Rationale for this change

See in the issue. 

1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`.
2. So, `data_size` in decoder would be equal to previous page size.
3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed

### What changes are included in this PR?

Change reader and add test

### Are these changes tested?

Yes

### Are there any user-facing changes?

No

**This PR contains a "Critical Fix".**
* Closes: #35423

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
… buffer resize smaller (apache#35428)

### Rationale for this change

See in the issue. 

1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`.
2. So, `data_size` in decoder would be equal to previous page size.
3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed

### What changes are included in this PR?

Change reader and add test

### Are these changes tested?

Yes

### Are there any user-facing changes?

No

**This PR contains a "Critical Fix".**
* Closes: apache#35423

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@jorisvandenbossche jorisvandenbossche modified the milestones: 13.0.0, 12.0.1 May 16, 2023
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
… buffer resize smaller (apache#35428)

### Rationale for this change

See in the issue. 

1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`.
2. So, `data_size` in decoder would be equal to previous page size.
3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed

### What changes are included in this PR?

Change reader and add test

### Are these changes tested?

Yes

### Are there any user-facing changes?

No

**This PR contains a "Critical Fix".**
* Closes: apache#35423

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
raulcd pushed a commit that referenced this issue May 30, 2023
…r resize smaller (#35428)

### Rationale for this change

See in the issue. 

1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`.
2. So, `data_size` in decoder would be equal to previous page size.
3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed

### What changes are included in this PR?

Change reader and add test

### Are these changes tested?

Yes

### Are there any user-facing changes?

No

**This PR contains a "Critical Fix".**
* Closes: #35423

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
jarohen pushed a commit to xtdb/arrow that referenced this issue Nov 3, 2023
… buffer resize smaller (apache#35428)

### Rationale for this change

See in the issue. 

1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`.
2. So, `data_size` in decoder would be equal to previous page size.
3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed

### What changes are included in this PR?

Change reader and add test

### Are these changes tested?

Yes

### Are there any user-facing changes?

No

**This PR contains a "Critical Fix".**
* Closes: apache#35423

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
jarohen pushed a commit to xtdb/arrow that referenced this issue Mar 7, 2024
… buffer resize smaller (apache#35428)

### Rationale for this change

See in the issue. 

1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`.
2. So, `data_size` in decoder would be equal to previous page size.
3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed

### What changes are included in this PR?

Change reader and add test

### Are these changes tested?

Yes

### Are there any user-facing changes?

No

**This PR contains a "Critical Fix".**
* Closes: apache#35423

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Parquet Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Priority: Critical Type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants