[Python] Error when reading table with list values from parquet #27474

asfimport · 2021-02-12T12:36:24Z

I'm getting unexpected results when reading tables containing list values and a large number of rows from a parquet file.

Example code (pyarrow 2.0.0 and 3.0.0):

from pyarrow import parquet, Table

data = [None] * (1 << 20)
data.append([1])

table = Table.from_arrays([data], ['column'])
print('Expected: %s' % table['column'][-1])

parquet.write_table(table, 'table.parquet')

table2 = parquet.read_table('table.parquet')
print('Actual:   %s' % table2['column'][-1]

Output:


Expected: [1]
Actual:   [0]

When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get:


Expected: [1]
Actual:   [1]

For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15.

It seems that this is caused by some overflow and memory corruption because in pyarrow 3.0.0 with more complex values (list of dictionaries with float and datetime):


data.append([{'a': 0.1, 'b': datetime.now()}])

I'm getting this exception after calling table2.to_pandas() :


/arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create default memory pool

Environment: Python 3.7
Reporter: Michal Glaus
Assignee: Micah Kornfield / @emkornfield

PRs and other links:

GitHub Pull Request #9498

_{Note: This issue was originally created as ARROW-11607. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2021-02-13T17:48:38Z

Micah Kornfield / @emkornfield:
[~misogl] thank you for the report. Just out of curiosity did you confirm that the issue isn't with writing?

asfimport · 2021-02-14T15:49:35Z

Micah Kornfield / @emkornfield:
One observation is that this value lives directly on a chunking boundary (the num chunks in the returned table is 2, with all the nulls in the prior chunk).

asfimport · 2021-02-14T21:10:16Z

Michal Glaus:
@emkornfield I have tested multiple consecutive reads and in version 2.0.0 I'm getting 0 up until this:


Read no. 176
Actual:   [0]
Read no. 177
Actual:   [4294967296]

I have discovered a possible workaround by setting row_group_size=100_000 in write_table. Anything up to 1 << 20 seems to fix the issue (at leaset for my test case).

asfimport · 2021-02-14T21:23:59Z

Micah Kornfield / @emkornfield:
So I now have a repro in C++ Unfortunately most of our unit tests are written against an API that doesn't use RecordBatchReader and thus this edge case wasn't caught there.

asfimport · 2021-02-15T06:06:27Z

Micah Kornfield / @emkornfield:
It looks like this is indeed a bug on the read side and it in some cases, I could see how this might cause corruption I think.

The internal error might be independent. Here is the code that could throw that error:

arrow/cpp/src/arrow/memory_pool.cc

Line 440 in f291cd7

std::unique_ptr<MemoryPool> MemoryPool::CreateDefault() {

asfimport · 2021-02-17T14:52:30Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 9498
#9498

asfimport closed this as completed Feb 17, 2021

asfimport assigned emkornfield Jan 10, 2023

asfimport added this to the 4.0.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Error when reading table with list values from parquet #27474

[Python] Error when reading table with list values from parquet #27474

asfimport commented Feb 12, 2021

asfimport commented Feb 13, 2021

asfimport commented Feb 14, 2021

asfimport commented Feb 14, 2021

asfimport commented Feb 14, 2021

asfimport commented Feb 15, 2021

asfimport commented Feb 17, 2021

[Python] Error when reading table with list values from parquet #27474

[Python] Error when reading table with list values from parquet #27474

Comments

asfimport commented Feb 12, 2021

PRs and other links:

asfimport commented Feb 13, 2021

asfimport commented Feb 14, 2021

asfimport commented Feb 14, 2021

asfimport commented Feb 14, 2021

asfimport commented Feb 15, 2021

asfimport commented Feb 17, 2021