[Python] pyarrow.csv.read_csv hangs + eats all RAM #22212

asfimport · 2019-06-29T23:29:07Z

I have quite a sparse dataset in CSV format. A wide table that has several rows but many (32k) columns. Total size ~540K.

When I read the dataset using pyarrow.csv.read_csv it hangs, gradually eats all memory and gets killed.

More details on the conditions further. Script to run and all mentioned files are under attachments.

sample_32769_cols.csv is the dataset that suffers the problem.
sample_32768_cols.csv is the dataset that DOES NOT suffer and is read in under 400ms on my machine. It's the same dataset without ONE last column. That last column is no different than others and has empty values.

The reason of why exactly this column makes difference between proper execution and hanging failure which looks like some memory leak - no idea.

I have created flame graph for the case (1) to support this issue resolution (graph.svg).

Environment: Ubuntu Xenial, python 2.7
Reporter: Bogdan Klichuk
Assignee: Micah Kornfield / @emkornfield

Original Issue Attachments:

PRs and other links:

GitHub Pull Request #4762

_{Note: This issue was originally created as ARROW-5791. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2019-06-30T00:42:36Z

Brian Hulette / @TheNeuralBit:
Thanks for the concise bug report! I haven't had a chance to dig into this very far, but I'm sure it's not a coincidence that 32768 == 2^15. 32767 is the max of an unsigned 16-bit integer, so if we're assigning an unsigned int16 to each column somewhere it would overflow once you get beyond 32768 columns (since one column gets 0).

I'm not sure where exactly that would be happening though. My first inclination was that it would be in the element count for the vector of fields, but according to the flatbuffers page vectors are prefixed by a 32-bit element count.

asfimport · 2019-06-30T00:56:53Z

Bogdan Klichuk:
@TheNeuralBit It's a shame I threw away the idea of "maybe this is just a power of 2" and didn't simply try. Great point.

asfimport · 2019-06-30T01:09:35Z

Bogdan Klichuk:
Just to point, I can successfully convert a dataframe (if I read it using Pandas) to pyarrow.Table directly.

df = pandas.read_csv('...', ...)
table = pyarrow.Table.from_pandas(df)

asfimport · 2019-06-30T15:38:16Z

Wes McKinney / @wesm:
Evidently there's an int16 overflow somewhere in the Arrow CSV codebase. An initial grep didn't turn up anything obvious.

Comparisons with other CSV libraries (like pandas) probably are not relevant since there is no code overlap.

asfimport · 2019-06-30T17:33:15Z

Micah Kornfield / @emkornfield:
I think I know where this is occurring, will try to patch tonight.

Do we want to support more columns or throw an error?

asfimport · 2019-06-30T19:01:55Z

Wes McKinney / @wesm:
I think we should support more columns

asfimport · 2019-06-30T23:37:21Z

Micah Kornfield / @emkornfield:
Made a PR, this wasn't really and overflow. Also, added a fixed cap on 1000*1024 which should be enough for anyone :)

asfimport · 2019-07-01T18:43:18Z

Wes McKinney / @wesm:
Issue resolved by pull request 4762
#4762

asfimport · 2019-07-02T08:23:10Z

Bogdan Klichuk:
Thanks a lot!

asfimport closed this as completed Jul 1, 2019

asfimport assigned emkornfield Jan 10, 2023

asfimport added this to the 0.14.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] pyarrow.csv.read_csv hangs + eats all RAM #22212

[Python] pyarrow.csv.read_csv hangs + eats all RAM #22212

asfimport commented Jun 29, 2019

asfimport commented Jun 30, 2019

asfimport commented Jun 30, 2019

asfimport commented Jun 30, 2019

asfimport commented Jun 30, 2019

asfimport commented Jun 30, 2019

asfimport commented Jun 30, 2019

asfimport commented Jun 30, 2019

asfimport commented Jul 1, 2019

asfimport commented Jul 2, 2019

[Python] pyarrow.csv.read_csv hangs + eats all RAM #22212

[Python] pyarrow.csv.read_csv hangs + eats all RAM #22212

Comments

asfimport commented Jun 29, 2019

Original Issue Attachments:

PRs and other links:

asfimport commented Jun 30, 2019

asfimport commented Jun 30, 2019

asfimport commented Jun 30, 2019

asfimport commented Jun 30, 2019

asfimport commented Jun 30, 2019

asfimport commented Jun 30, 2019

asfimport commented Jun 30, 2019

asfimport commented Jul 1, 2019

asfimport commented Jul 2, 2019