New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] pyarrow.csv.read_csv hangs + eats all RAM #22212
Comments
Brian Hulette / @TheNeuralBit: I'm not sure where exactly that would be happening though. My first inclination was that it would be in the element count for the vector of fields, but according to the flatbuffers page vectors are prefixed by a 32-bit element count. |
Bogdan Klichuk: |
Bogdan Klichuk: df = pandas.read_csv('...', ...)
table = pyarrow.Table.from_pandas(df) |
Wes McKinney / @wesm: Comparisons with other CSV libraries (like pandas) probably are not relevant since there is no code overlap. |
Micah Kornfield / @emkornfield: Do we want to support more columns or throw an error? |
Wes McKinney / @wesm: |
Micah Kornfield / @emkornfield: |
Wes McKinney / @wesm: |
Bogdan Klichuk: |
I have quite a sparse dataset in CSV format. A wide table that has several rows but many (32k) columns. Total size ~540K.
When I read the dataset using
pyarrow.csv.read_csv
it hangs, gradually eats all memory and gets killed.More details on the conditions further. Script to run and all mentioned files are under attachments.
sample_32769_cols.csv
is the dataset that suffers the problem.sample_32768_cols.csv
is the dataset that DOES NOT suffer and is read in under 400ms on my machine. It's the same dataset without ONE last column. That last column is no different than others and has empty values.The reason of why exactly this column makes difference between proper execution and hanging failure which looks like some memory leak - no idea.
I have created flame graph for the case (1) to support this issue resolution (
graph.svg
).Environment: Ubuntu Xenial, python 2.7
Reporter: Bogdan Klichuk
Assignee: Micah Kornfield / @emkornfield
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-5791. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: