Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python hangs after write a few parquet tables #17324

Closed
asfimport opened this issue Aug 1, 2017 · 16 comments
Closed

python hangs after write a few parquet tables #17324

asfimport opened this issue Aug 1, 2017 · 16 comments

Comments

@asfimport
Copy link

I had a program to read some csv files (a few million rows each, 9 columns), and converted with:

import os
import pandas as pd

import pyarrow.parquet as pq
import pyarrow

def to_parquet(output_file, csv_file):
    df = pd.read_csv(csv_file)
    df['gecco_variant'] = [ v.lstrip('0') for v in df['gecco_variant']]
    table = pyarrow.Table.from_pandas(df)
    pq.write_table(table, output_file)

The first csv file would always complete, but python would hang on the second or third file, and sometimes on a much later file.

Environment: Python 3.5.2, pyarrow 0.5.0
Reporter: Keith Curtis
Assignee: Wes McKinney / @wesm

Original Issue Attachments:

Note: This issue was originally created as ARROW-1311. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
hi [~K94] – on what platform, and how did you install? Seems this may be ARROW-1282.

@xhochy I think we need to get patched builds out ASAP. Let me know how you want to proceed

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
@wesm We should simply disable jemalloc by default until these problems have been resolved. I will try to reproduce locally and then talk to the jemalloc people to get it fixed upstream.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
We could release patched builds on PyPI but there is the performance regression ARROW-1290. I may update 0.5.0 on conda-forge to include this patch and disable jemalloc for now

@asfimport
Copy link
Author

Keith Curtis:
Stack trace from gdb when Python appeared to be hung.

@asfimport
Copy link
Author

Keith Curtis:
I re-ran my code, and have a revised function, where I added a line to update the column, which seems to matter.

def to_parquet(output_file, csv_file):
df = pd.read_csv(csv_file)
df['gecco_variant'] = [ v.lstrip('0') for v in df['gecco_variant']]
table = pyarrow.Table.from_pandas(df)
pq.write_table(table, output_file)

When Python seemed hung (after 3 minutes with no progress), I captured a stack trace with gdb, and attached the file

I'm running on Ubuntu 14.04.3. I installed into a conda virtual environment using pip.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Thanks, indeed this is ARROW-1282. I'm in the process of updating 0.5.0 binaries to disable the jemalloc allocator.

If you are using pip, can you try pip install pyarrow==0.5.* which should pull the 0.5.0.post1 updated build? If you are using conda, it will take me a little while to update the binaries on conda-forge.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Same issue as ARROW-1282

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Actually, I made a mistake in the build, and need to post another one, hang on for a few minutes.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Should be all set now with 0.5.0.post2

@asfimport
Copy link
Author

Keith Curtis:
Hi, I think I have the updated one:
$ pip install --upgrade pyarrow==0.5.*
Collecting pyarrow==0.5.*
Downloading pyarrow-0.5.0.post1-cp35-cp35m-manylinux1_x86_64.whl (8.9MB)
...

I re-ran my script, but python appeared to hang, and the stack trace looks similar:

#0 je_spin_adaptive (spin=) at include/jemalloc/internal/spin.h:40
#1 chunk_dss_max_update (new_addr=) at src/chunk_dss.c:83
#2 je_chunk_alloc_dss (tsdn=tsdn@entry=0x7f6d609ab620, arena=arena@entry=0x7f6ca8800140, new_addr=new_addr@entry=0x7f6c33000000, size=size@entry=8388608,
alignment=alignment@entry=2097152, zero=zero@entry=0x7fff45db9850, commit=commit@entry=0x7fff45db97a0) at src/chunk_dss.c:122
#3 0x00007f6ca92bb02f in chunk_alloc_core (dss_prec=dss_prec_secondary, commit=0x7fff45db97a0, zero=0x7fff45db9850, alignment=2097152, size=8388608, new_addr=0x7f6c33000000,
arena=0x7f6ca8800140, tsdn=0x7f6d609ab620) at src/chunk.c:357
#4 chunk_alloc_default_impl (commit=0x7fff45db97a0, zero=0x7fff45db9850, alignment=2097152, size=8388608, new_addr=0x7f6c33000000, arena=0x7f6ca8800140, tsdn=0x7f6d609ab620)
at src/chunk.c:430
#5 je_chunk_alloc_wrapper (tsdn=tsdn@entry=0x7f6d609ab620, arena=arena@entry=0x7f6ca8800140, chunk_hooks=chunk_hooks@entry=0x7fff45db97c0, new_addr=new_addr@entry=0x7f6c33000000,
size=size@entry=8388608, alignment=2097152, sn=sn@entry=0x7fff45db97b0, zero=zero@entry=0x7fff45db9850, commit=commit@entry=0x7fff45db97a0) at src/chunk.c:490
...

@asfimport
Copy link
Author

Keith Curtis:
Ok, I'll re-try with post2

@asfimport
Copy link
Author

Keith Curtis:
I re-ran my script with pyarrow-0.5.0.post2; that seemed to fixed it, my script ran smoothly converting 22 csv files to parquet format. Thanks!

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Cool, thank you! And very sorry about the trouble. We would have learned about these problems with jemalloc earlier but we only made it the default allocator in 0.5.0 so it's good to know so we can work with the jemalloc developers to figure out what's wrong

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
Two things I see in this BT:

  • The size change is quite small: size=2097216, oldsize=2097152
  • We're in the are of 2GiB of allocated memory and want to expand the region by a some page, not a full re-allocation.

[~K94] It would be nice if you could try to run into the problematic situation again and get a more detailed traceback with thread apply all bt full instead of bt in gdb. This would help me better understand the problem. I sadly yet fail reproduce locally.

@asfimport
Copy link
Author

Keith Curtis:
Okay, to get the new backtrace, attached, I had to install the 0.5.0 version from conda:
pyarrow: 0.5.0-np112py35_0 conda-forge

I see there's a lot of threads in there (64?), more than I expected. I ran it from the ipython qtconsole, maybe that has something to do with it. Hope that helps.

@asfimport
Copy link
Author

Todd Farmer / @toddfarmer:
Transitioning issue from Resolved to Closed to based on resolution field value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants