Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pyarrow.csv.read_csv hangs + eats all RAM #22212

Closed
asfimport opened this issue Jun 29, 2019 · 9 comments
Closed

[Python] pyarrow.csv.read_csv hangs + eats all RAM #22212

asfimport opened this issue Jun 29, 2019 · 9 comments

Comments

@asfimport
Copy link

I have quite a sparse dataset in CSV format. A wide table that has several rows but many (32k) columns. Total size ~540K.

When I read the dataset using pyarrow.csv.read_csv it hangs, gradually eats all memory and gets killed.

More details on the conditions further. Script to run and all mentioned files are under attachments.

  1. sample_32769_cols.csv is the dataset that suffers the problem.

  2. sample_32768_cols.csv is the dataset that DOES NOT suffer and is read in under 400ms on my machine. It's the same dataset without ONE last column. That last column is no different than others and has empty values.

The reason of why exactly this column makes difference between proper execution and hanging failure which looks like some memory leak - no idea.

I have created flame graph for the case (1) to support this issue resolution (graph.svg).

 

Environment: Ubuntu Xenial, python 2.7
Reporter: Bogdan Klichuk
Assignee: Micah Kornfield / @emkornfield

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-5791. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Brian Hulette / @TheNeuralBit:
Thanks for the concise bug report! I haven't had a chance to dig into this very far, but I'm sure it's not a coincidence that 32768 == 2^15. 32767 is the max of an unsigned 16-bit integer, so if we're assigning an unsigned int16 to each column somewhere it would overflow once you get beyond 32768 columns (since one column gets 0).

I'm not sure where exactly that would be happening though. My first inclination was that it would be in the element count for the vector of fields, but according to the flatbuffers page vectors are prefixed by a 32-bit element count.

@asfimport
Copy link
Author

Bogdan Klichuk:
@TheNeuralBit It's a shame I threw away the idea of "maybe this is just a power of 2" and didn't simply try. Great point.

@asfimport
Copy link
Author

Bogdan Klichuk:
Just to point, I can successfully convert a dataframe (if I read it using Pandas) to pyarrow.Table directly.

df = pandas.read_csv('...', ...)
table = pyarrow.Table.from_pandas(df)

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Evidently there's an int16 overflow somewhere in the Arrow CSV codebase. An initial grep didn't turn up anything obvious.

Comparisons with other CSV libraries (like pandas) probably are not relevant since there is no code overlap.

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
I think I know where this is occurring, will try to patch tonight.

Do we want to support more columns or throw an error?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I think we should support more columns

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
Made a PR, this wasn't really and overflow. Also, added a fixed cap on 1000*1024 which should be enough for anyone :)

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 4762
#4762

@asfimport
Copy link
Author

Bogdan Klichuk:
Thanks a lot! 

@asfimport asfimport added this to the 0.14.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants