Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize #23190

Closed
asfimport opened this issue Oct 11, 2019 · 5 comments

Comments

@asfimport
Copy link
Collaborator

I'll need to jump through hoops to upload the (seemingly-valid) Parquet file that triggers this bug. In the meantime, here's the error I get, reading the Parquet file with read_dictionary=true. I'll start with the stack trace:

Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

#0 0x0000000000b9fffd in __cxa_throw ()
#1 0x00000000004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow (this=0x555556612e50, num_values=67339, null_count=0, valid_bits=0x7f39a764b780 '\377' <repeats 200 times>..., valid_bits_offset=748544,
` builder=0x555556616330) at /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886 #2 0x000000000046d703 in parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced (this=0x555556616260, values_to_read=67339, null_count=0) \ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314 #3 0x00000000004a13f8 in parquet::internal::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6> >::ReadRecordData (this=0x555556616260, num_records=67339) \ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096 #4 0x0000000000493876 in parquet::internal::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6> >::ReadRecords (this=0x555556616260, num_records=815883) \ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875 #5 0x0000000000413955 in parquet::arrow::LeafReader::NextBatch (this=0x555556615640, records_to_read=815883, out=0x7ffd4b5afab0) at /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413 #6 0x0000000000412081 in parquet::arrow::FileReaderImpl::ReadColumn (this=0x5555566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218 #7 0x00000000004121b0 in parquet::arrow::FileReaderImpl::ReadColumn (this=0x5555566067a0, i=7, out=0x7ffd4b5afab0) at /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223 #8 0x0000000000405fbd in readParquet(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) ()`

And now a report of my gdb adventures:

In Arrow 0.15.0, when reading a particular dictionary column (read_dictionaries=true) with 815883 rows that was written by Arrow 0.14.1, arrow::Dictionary32Builder<arrow::BinaryType>::AppendIndices(...) is called twice (once with 493568 values, once with 254976 values); and then PlainByteArrayDecoder::DecodeArrow() is called. (I'm a novice; I don't know why this column comes in three batches.) On first AppendIndices() call, the buffer capacity is equal to the number of values. On second call, that's no longer the case: the buffer grows using BufferBuilder::GrowByFactor, so its capacity is 987136.

But there's a bug: the 987136-capacity buffer is in Dictionary32Builder::indices_builder_; so 987136 is stored in Dictionary32Builder::indices_builder_.capacity_. Dictionary32Builder::capacity_ does not change when AppendIndices() is called. (Dictionary32Builder behaves like a proxy for its indices_builder_; but its capacity() method is not virtual, so things are messy.)

So builder.capacity_ is 0. Then comes the final batch of 67339 values, via DecodeArrow(). It calls builder->Reserve(num_values). But builder->Reserve(num_values) tries to increase the capacity from 0 (its wrong, cached value) to length_ + num_values (815883). Since indicies_builder->capacity_ is 987136, that's a downsize – which throws an exception.

The only workaround I can find: use read_dictionaries=false.

This affects Python, too.

I've attached a patch that fixes the issue for my file. I don't know how to formulate a reduction, though, so I haven't contributed unit tests. I'm also not certain how FinishInternal is meant to work, so this definitely needs expert review. (FinishInternal was definitely buggy before my patch; after my patch it might be buggy but I don't know.)

Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
Reporter: Adam Hooper / @adamhooper
Assignee: Wes McKinney / @wesm

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-6861. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Adam Hooper / @adamhooper:
I've attached a Parquet file, written by Arrow 0.14.1, which causes this problem. Column 8 (among others) causes this problem. Most columns work fine.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Thanks. This should be enough information to help write a unit test to reproduce the issue. @bkietz are you interested in taking a look?

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Seems like a good candidate for 0.15.1. Marked as such

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I started looking at this

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Issue resolved by pull request 5643
#5643

@asfimport asfimport added this to the 0.15.1 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants