[C++][Parquet] "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0 #35423

adamreeve · 2023-05-04T02:11:02Z

Describe the bug, including details regarding any error messages, version, and platform.

Arrow 12.0.0 has a regression where it can crash when reading byte-stream split encoded data written by itself or older versions of Arrow:

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

x = pa.array(np.linspace(0.0, 1.0, 1_000_000), type=pa.float32())
table = pa.Table.from_arrays([x], names=['x'])
pq.write_table(table, 'data.parquet', use_dictionary=False, use_byte_stream_split=True)

table = pq.read_table('data.parquet')
print(table)

This crashes with:

Traceback (most recent call last):
  File "/home/.../write_read_data.py", line 9, in <module>
	table = pq.read_table('data.parquet')
			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py", line 2986, in read_table
	return dataset.read(columns=columns, use_threads=use_threads,
		   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py", line 2614, in read
	table = self._dataset.to_table(
			^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3449, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Data size too large for number of values (padding in byte stream split data page?)

But the above code works fine with pyarrow 11.0.0 and and 10.0.1.

It appears that #34140 caused this regression. I tested building pyarrow on the current main branch (commit 42d42b1) and could reproduce the error, but it was fixed after I reverted the merge of that PR (commit c31fb46).

Component(s)

Parquet

The text was updated successfully, but these errors were encountered:

adamreeve · 2023-05-04T02:32:46Z

If I run this in a debugger, I can see ByteStreamSplitDecoder<DType>::SetData is called 4 times. For the first 3 times, num_values is 262,144 and len is 1,048,576, then on the 4th time len is again 1,048,576 but num_values is only 213,568 (for a total of 1,000,000 values).

If I only write 262,144 values, or a multiple of that, then everything works. But writing 262,145 values causes the crash.

From reading some of the context in #15173, it looks like the assumption that the encoders don't add any padding was incorrect?

mapleFU · 2023-05-04T02:46:56Z

I'll try to reproduce this problem. In ByteStreamSplitDecoder::SetData, it expects that num_values is greater or equals to the num of values in ByteStreamSplit, so it expects num_values * static_cast<int64_t>(sizeof(T)) >= len. It would be == if there is no null value in the page

I've reproduce this issue using PyArrow 12.0. I'll find out why.

mapleFU · 2023-05-04T06:36:44Z

Seems that using ParquetTableReader to just read that file is ok, and size as 100_000 or 200_000 is ok, but 300_000 would failed. I guess ParquetFileFormat with a huge page size is not ok.

mapleFU · 2023-05-04T06:56:18Z

I've found out the reason, I think it's not a bug in ByteStreamSplit, it's caused by page.size(), page reuse one buffer when ReadNextPage, and it size might be greater than expect size( page.uncompressed_size() ). However, ByteStreamSplit could expose that problem because it checks the boundary

…r resize smaller (#35428) ### Rationale for this change See in the issue. 1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`. 2. So, `data_size` in decoder would be equal to previous page size. 3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed ### What changes are included in this PR? Change reader and add test ### Are these changes tested? Yes ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * Closes: #35423 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

… buffer resize smaller (apache#35428) ### Rationale for this change See in the issue. 1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`. 2. So, `data_size` in decoder would be equal to previous page size. 3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed ### What changes are included in this PR? Change reader and add test ### Are these changes tested? Yes ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * Closes: apache#35423 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

…r resize smaller (#35428) ### Rationale for this change See in the issue. 1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`. 2. So, `data_size` in decoder would be equal to previous page size. 3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed ### What changes are included in this PR? Change reader and add test ### Are these changes tested? Yes ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * Closes: #35423 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

… buffer resize smaller (apache#35428) ### Rationale for this change See in the issue. 1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`. 2. So, `data_size` in decoder would be equal to previous page size. 3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed ### What changes are included in this PR? Change reader and add test ### Are these changes tested? Yes ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * Closes: apache#35423 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

adamreeve added the Type: bug label May 4, 2023

github-actions bot added the Component: Parquet label May 4, 2023

adamreeve mentioned this issue May 4, 2023

[R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding #35105

Closed

jorisvandenbossche added this to the 12.0.1 milestone May 4, 2023

jorisvandenbossche added the Priority: Critical label May 4, 2023

github-actions bot mentioned this issue May 4, 2023

GH-35423: [C++][Parquet] Parquet PageReader Force decompression buffer resize smaller #35428

Merged

github-actions bot assigned mapleFU May 4, 2023

raulcd changed the title ~~"Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0~~ [C++][Parquet] "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0 May 4, 2023

wjones127 added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label May 9, 2023

pitrou closed this as completed in #35428 May 15, 2023

pitrou modified the milestones: 12.0.1, 13.0.0 May 15, 2023

jorisvandenbossche modified the milestones: 13.0.0, 12.0.1 May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0 #35423

[C++][Parquet] "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0 #35423

adamreeve commented May 4, 2023

adamreeve commented May 4, 2023 •

edited

Loading

mapleFU commented May 4, 2023 •

edited

Loading

mapleFU commented May 4, 2023

mapleFU commented May 4, 2023 •

edited

Loading

[C++][Parquet] "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0 #35423

[C++][Parquet] "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0 #35423

Comments

adamreeve commented May 4, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

adamreeve commented May 4, 2023 • edited Loading

mapleFU commented May 4, 2023 • edited Loading

mapleFU commented May 4, 2023

mapleFU commented May 4, 2023 • edited Loading

adamreeve commented May 4, 2023 •

edited

Loading

mapleFU commented May 4, 2023 •

edited

Loading

mapleFU commented May 4, 2023 •

edited

Loading