-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0 #35423
Comments
If I run this in a debugger, I can see If I only write 262,144 values, or a multiple of that, then everything works. But writing 262,145 values causes the crash. From reading some of the context in #15173, it looks like the assumption that the encoders don't add any padding was incorrect? |
I'll try to reproduce this problem. In I've reproduce this issue using PyArrow 12.0. I'll find out why. |
Seems that using ParquetTableReader to just read that file is ok, and size as 100_000 or 200_000 is ok, but 300_000 would failed. I guess |
I've found out the reason, I think it's not a bug in ByteStreamSplit, it's caused by |
…r resize smaller (#35428) ### Rationale for this change See in the issue. 1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`. 2. So, `data_size` in decoder would be equal to previous page size. 3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed ### What changes are included in this PR? Change reader and add test ### Are these changes tested? Yes ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * Closes: #35423 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
… buffer resize smaller (apache#35428) ### Rationale for this change See in the issue. 1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`. 2. So, `data_size` in decoder would be equal to previous page size. 3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed ### What changes are included in this PR? Change reader and add test ### Are these changes tested? Yes ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * Closes: apache#35423 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
… buffer resize smaller (apache#35428) ### Rationale for this change See in the issue. 1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`. 2. So, `data_size` in decoder would be equal to previous page size. 3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed ### What changes are included in this PR? Change reader and add test ### Are these changes tested? Yes ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * Closes: apache#35423 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…r resize smaller (#35428) ### Rationale for this change See in the issue. 1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`. 2. So, `data_size` in decoder would be equal to previous page size. 3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed ### What changes are included in this PR? Change reader and add test ### Are these changes tested? Yes ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * Closes: #35423 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
… buffer resize smaller (apache#35428) ### Rationale for this change See in the issue. 1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`. 2. So, `data_size` in decoder would be equal to previous page size. 3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed ### What changes are included in this PR? Change reader and add test ### Are these changes tested? Yes ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * Closes: apache#35423 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
… buffer resize smaller (apache#35428) ### Rationale for this change See in the issue. 1. SerializedPageReader will reuse same decompression buffer, **and didn't resize if next page is smaller than previous one**, so, `buffer->size()` might not equal to `page.uncompressed_size()`. 2. So, `data_size` in decoder would be equal to previous page size. 3. When it goes to BYTE_STREAM_SPLIT, BYTE_STREAM_SPLIT will check the size not too large. This will cause throw an exception. When it goes to other readers, here is no bug, because uninitialized memory will not be accessed ### What changes are included in this PR? Change reader and add test ### Are these changes tested? Yes ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * Closes: apache#35423 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the bug, including details regarding any error messages, version, and platform.
Arrow 12.0.0 has a regression where it can crash when reading byte-stream split encoded data written by itself or older versions of Arrow:
This crashes with:
But the above code works fine with pyarrow 11.0.0 and and 10.0.1.
It appears that #34140 caused this regression. I tested building pyarrow on the current main branch (commit 42d42b1) and could reproduce the error, but it was fixed after I reverted the merge of that PR (commit c31fb46).
Component(s)
Parquet
The text was updated successfully, but these errors were encountered: