Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Error when reading table with list values from parquet #27474

Closed
asfimport opened this issue Feb 12, 2021 · 6 comments
Closed

[Python] Error when reading table with list values from parquet #27474

asfimport opened this issue Feb 12, 2021 · 6 comments

Comments

@asfimport
Copy link

I'm getting unexpected results when reading tables containing list values and a large number of rows from a parquet file.

Example code (pyarrow 2.0.0 and 3.0.0):

from pyarrow import parquet, Table

data = [None] * (1 << 20)
data.append([1])

table = Table.from_arrays([data], ['column'])
print('Expected: %s' % table['column'][-1])

parquet.write_table(table, 'table.parquet')

table2 = parquet.read_table('table.parquet')
print('Actual:   %s' % table2['column'][-1]

Output:


Expected: [1]
Actual:   [0]

When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get:


Expected: [1]
Actual:   [1]

For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15.

It seems that this is caused by some overflow and memory corruption because in pyarrow 3.0.0 with more complex values (list of dictionaries with float and datetime):


data.append([{'a': 0.1, 'b': datetime.now()}])

I'm getting this exception after calling table2.to_pandas() :


/arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create default memory pool

 

Environment: Python 3.7
Reporter: Michal Glaus
Assignee: Micah Kornfield / @emkornfield

PRs and other links:

Note: This issue was originally created as ARROW-11607. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
[~misogl]  thank you for the report.  Just out of curiosity did you confirm that the issue isn't with writing? 

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
One observation is that this value lives directly on a chunking boundary (the num chunks in the returned table is 2, with all the nulls in the prior chunk).

@asfimport
Copy link
Author

Michal Glaus:
@emkornfield I have tested multiple consecutive reads and in version 2.0.0 I'm getting 0 up until this:


Read no. 176
Actual:   [0]
Read no. 177
Actual:   [4294967296]

I have discovered a possible workaround by setting row_group_size=100_000 in write_table. Anything up to 1 << 20 seems to fix the issue (at leaset for my test case).

 

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
So I now have a repro in C++ Unfortunately most of our unit tests are written against an API that doesn't use RecordBatchReader and thus this edge case wasn't caught there.

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
It looks like this is indeed a bug on the read side and it in some cases, I could see how this might cause corruption I think.  

 

The internal error might be independent.  Here is the code that could throw that error:

std::unique_ptr<MemoryPool> MemoryPool::CreateDefault() {

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 9498
#9498

@asfimport asfimport added this to the 4.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants