You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get:
Expected: [1]
Actual: [1]
For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15.
It seems that this is caused by some overflow and memory corruption because in pyarrow 3.0.0 with more complex values (list of dictionaries with float and datetime):
data.append([{'a': 0.1, 'b': datetime.now()}])
I'm getting this exception after calling table2.to_pandas() :
/arrow/cpp/src/arrow/memory_pool.cc:501: Internal error: cannot create default memory pool
Micah Kornfield / @emkornfield:
One observation is that this value lives directly on a chunking boundary (the num chunks in the returned table is 2, with all the nulls in the prior chunk).
I have discovered a possible workaround by setting row_group_size=100_000 in write_table. Anything up to 1 << 20 seems to fix the issue (at leaset for my test case).
Micah Kornfield / @emkornfield:
So I now have a repro in C++ Unfortunately most of our unit tests are written against an API that doesn't use RecordBatchReader and thus this edge case wasn't caught there.
Micah Kornfield / @emkornfield:
It looks like this is indeed a bug on the read side and it in some cases, I could see how this might cause corruption I think.
The internal error might be independent. Here is the code that could throw that error:
I'm getting unexpected results when reading tables containing list values and a large number of rows from a parquet file.
Example code (pyarrow 2.0.0 and 3.0.0):
Output:
When I decrease the number of rows by 1 (by using (1 << 20) - 1), I get:
For pyarrow 1.0.1 and 1.0.0, the threshold number of rows is 1 << 15.
It seems that this is caused by some overflow and memory corruption because in pyarrow 3.0.0 with more complex values (list of dictionaries with float and datetime):
I'm getting this exception after calling table2.to_pandas() :
Environment: Python 3.7
Reporter: Michal Glaus
Assignee: Micah Kornfield / @emkornfield
PRs and other links:
Note: This issue was originally created as ARROW-11607. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: