[Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize #23190

asfimport · 2019-10-11T17:20:12Z

I'll need to jump through hoops to upload the (seemingly-valid) Parquet file that triggers this bug. In the meantime, here's the error I get, reading the Parquet file with read_dictionary=true. I'll start with the stack trace:

Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

#0 0x0000000000b9fffd in __cxa_throw ()
#1 0x00000000004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow (this=0x555556612e50, num_values=67339, null_count=0, valid_bits=0x7f39a764b780 '\377' <repeats 200 times>..., valid_bits_offset=748544,
` builder=0x555556616330) at /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886 #2 0x000000000046d703 in parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced (this=0x555556616260, values_to_read=67339, null_count=0) \ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314 #3 0x00000000004a13f8 in parquet::internal::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6> >::ReadRecordData (this=0x555556616260, num_records=67339) \ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096 #4 0x0000000000493876 in parquet::internal::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6> >::ReadRecords (this=0x555556616260, num_records=815883) \ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875 #5 0x0000000000413955 in parquet::arrow::LeafReader::NextBatch (this=0x555556615640, records_to_read=815883, out=0x7ffd4b5afab0) at /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413 #6 0x0000000000412081 in parquet::arrow::FileReaderImpl::ReadColumn (this=0x5555566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218 #7 0x00000000004121b0 in parquet::arrow::FileReaderImpl::ReadColumn (this=0x5555566067a0, i=7, out=0x7ffd4b5afab0) at /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223 #8 0x0000000000405fbd in readParquet(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) ()`

And now a report of my gdb adventures:

In Arrow 0.15.0, when reading a particular dictionary column (read_dictionaries=true) with 815883 rows that was written by Arrow 0.14.1, arrow::Dictionary32Builder<arrow::BinaryType>::AppendIndices(...) is called twice (once with 493568 values, once with 254976 values); and then PlainByteArrayDecoder::DecodeArrow() is called. (I'm a novice; I don't know why this column comes in three batches.) On first AppendIndices() call, the buffer capacity is equal to the number of values. On second call, that's no longer the case: the buffer grows using BufferBuilder::GrowByFactor, so its capacity is 987136.

But there's a bug: the 987136-capacity buffer is in Dictionary32Builder::indices_builder_; so 987136 is stored in Dictionary32Builder::indices_builder_.capacity_. Dictionary32Builder::capacity_ does not change when AppendIndices() is called. (Dictionary32Builder behaves like a proxy for its indices_builder_; but its capacity() method is not virtual, so things are messy.)

So builder.capacity_ is 0. Then comes the final batch of 67339 values, via DecodeArrow(). It calls builder->Reserve(num_values). But builder->Reserve(num_values) tries to increase the capacity from 0 (its wrong, cached value) to length_ + num_values (815883). Since indicies_builder->capacity_ is 987136, that's a downsize – which throws an exception.

The only workaround I can find: use read_dictionaries=false.

This affects Python, too.

I've attached a patch that fixes the issue for my file. I don't know how to formulate a reduction, though, so I haven't contributed unit tests. I'm also not certain how FinishInternal is meant to work, so this definitely needs expert review. (FinishInternal was definitely buggy before my patch; after my patch it might be buggy but I don't know.)

Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
Reporter: Adam Hooper / @adamhooper
Assignee: Wes McKinney / @wesm

Original Issue Attachments:

PRs and other links:

GitHub Pull Request #5643

_{Note: This issue was originally created as ARROW-6861. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2019-10-11T19:20:15Z

Adam Hooper / @adamhooper:
I've attached a Parquet file, written by Arrow 0.14.1, which causes this problem. Column 8 (among others) causes this problem. Most columns work fine.

asfimport · 2019-10-11T21:03:43Z

Wes McKinney / @wesm:
Thanks. This should be enough information to help write a unit test to reproduce the issue. @bkietz are you interested in taking a look?

asfimport · 2019-10-11T21:44:57Z

Wes McKinney / @wesm:
Seems like a good candidate for 0.15.1. Marked as such

asfimport · 2019-10-14T01:16:07Z

Wes McKinney / @wesm:
I started looking at this

asfimport · 2019-10-17T18:02:45Z

Wes McKinney / @wesm:
Issue resolved by pull request 5643
#5643

asfimport closed this as completed Oct 17, 2019

asfimport assigned wesm Jan 10, 2023

asfimport added this to the 0.15.1 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize #23190

[Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize #23190

asfimport commented Oct 11, 2019

asfimport commented Oct 11, 2019

asfimport commented Oct 11, 2019

asfimport commented Oct 11, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 17, 2019

[Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize #23190

[Python] arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize #23190

Comments

asfimport commented Oct 11, 2019

Original Issue Attachments:

PRs and other links:

asfimport commented Oct 11, 2019

asfimport commented Oct 11, 2019

asfimport commented Oct 11, 2019

asfimport commented Oct 14, 2019

asfimport commented Oct 17, 2019