python hangs after write a few parquet tables #17324

asfimport · 2017-08-01T22:37:45Z

I had a program to read some csv files (a few million rows each, 9 columns), and converted with:

import os
import pandas as pd

import pyarrow.parquet as pq
import pyarrow

def to_parquet(output_file, csv_file):
    df = pd.read_csv(csv_file)
    df['gecco_variant'] = [ v.lstrip('0') for v in df['gecco_variant']]
    table = pyarrow.Table.from_pandas(df)
    pq.write_table(table, output_file)

The first csv file would always complete, but python would hang on the second or third file, and sometimes on a much later file.

Environment: Python 3.5.2, pyarrow 0.5.0
Reporter: Keith Curtis
Assignee: Wes McKinney / @wesm

Original Issue Attachments:

_{Note: This issue was originally created as ARROW-1311. Please see the migration documentation for further details.}

asfimport · 2017-08-02T02:29:43Z

Wes McKinney / @wesm:
hi [~K94] – on what platform, and how did you install? Seems this may be ARROW-1282.

@xhochy I think we need to get patched builds out ASAP. Let me know how you want to proceed

asfimport · 2017-08-02T16:01:07Z

Uwe Korn / @xhochy:
@wesm We should simply disable jemalloc by default until these problems have been resolved. I will try to reproduce locally and then talk to the jemalloc people to get it fixed upstream.

asfimport · 2017-08-02T16:16:32Z

Wes McKinney / @wesm:
We could release patched builds on PyPI but there is the performance regression ARROW-1290. I may update 0.5.0 on conda-forge to include this patch and disable jemalloc for now

asfimport · 2017-08-02T21:54:58Z

Keith Curtis:
Stack trace from gdb when Python appeared to be hung.

asfimport · 2017-08-02T21:56:42Z

Keith Curtis:
I re-ran my code, and have a revised function, where I added a line to update the column, which seems to matter.

def to_parquet(output_file, csv_file):
df = pd.read_csv(csv_file)
df['gecco_variant'] = [ v.lstrip('0') for v in df['gecco_variant']]
table = pyarrow.Table.from_pandas(df)
pq.write_table(table, output_file)

When Python seemed hung (after 3 minutes with no progress), I captured a stack trace with gdb, and attached the file

I'm running on Ubuntu 14.04.3. I installed into a conda virtual environment using pip.

asfimport · 2017-08-02T21:59:04Z

Wes McKinney / @wesm:
Thanks, indeed this is ARROW-1282. I'm in the process of updating 0.5.0 binaries to disable the jemalloc allocator.

If you are using pip, can you try pip install pyarrow==0.5.* which should pull the 0.5.0.post1 updated build? If you are using conda, it will take me a little while to update the binaries on conda-forge.

asfimport · 2017-08-02T21:59:22Z

Wes McKinney / @wesm:
Same issue as ARROW-1282

asfimport · 2017-08-02T22:02:46Z

Wes McKinney / @wesm:
Actually, I made a mistake in the build, and need to post another one, hang on for a few minutes.

asfimport · 2017-08-02T22:09:39Z

Wes McKinney / @wesm:
Should be all set now with 0.5.0.post2

asfimport · 2017-08-02T22:10:50Z

Keith Curtis:
Hi, I think I have the updated one:
$ pip install --upgrade pyarrow==0.5.*
Collecting pyarrow==0.5.*
Downloading pyarrow-0.5.0.post1-cp35-cp35m-manylinux1_x86_64.whl (8.9MB)
...

I re-ran my script, but python appeared to hang, and the stack trace looks similar:

#0 je_spin_adaptive (spin=) at include/jemalloc/internal/spin.h:40
#1 chunk_dss_max_update (new_addr=) at src/chunk_dss.c:83
#2 je_chunk_alloc_dss (tsdn=tsdn@entry=0x7f6d609ab620, arena=arena@entry=0x7f6ca8800140, new_addr=new_addr@entry=0x7f6c33000000, size=size@entry=8388608,
alignment=alignment@entry=2097152, zero=zero@entry=0x7fff45db9850, commit=commit@entry=0x7fff45db97a0) at src/chunk_dss.c:122
#3 0x00007f6ca92bb02f in chunk_alloc_core (dss_prec=dss_prec_secondary, commit=0x7fff45db97a0, zero=0x7fff45db9850, alignment=2097152, size=8388608, new_addr=0x7f6c33000000,
arena=0x7f6ca8800140, tsdn=0x7f6d609ab620) at src/chunk.c:357
#4 chunk_alloc_default_impl (commit=0x7fff45db97a0, zero=0x7fff45db9850, alignment=2097152, size=8388608, new_addr=0x7f6c33000000, arena=0x7f6ca8800140, tsdn=0x7f6d609ab620)
at src/chunk.c:430
#5 je_chunk_alloc_wrapper (tsdn=tsdn@entry=0x7f6d609ab620, arena=arena@entry=0x7f6ca8800140, chunk_hooks=chunk_hooks@entry=0x7fff45db97c0, new_addr=new_addr@entry=0x7f6c33000000,
size=size@entry=8388608, alignment=2097152, sn=sn@entry=0x7fff45db97b0, zero=zero@entry=0x7fff45db9850, commit=commit@entry=0x7fff45db97a0) at src/chunk.c:490
...

asfimport · 2017-08-02T22:12:58Z

Keith Curtis:
Ok, I'll re-try with post2

asfimport · 2017-08-02T22:19:59Z

Keith Curtis:
I re-ran my script with pyarrow-0.5.0.post2; that seemed to fixed it, my script ran smoothly converting 22 csv files to parquet format. Thanks!

asfimport · 2017-08-02T22:22:22Z

Wes McKinney / @wesm:
Cool, thank you! And very sorry about the trouble. We would have learned about these problems with jemalloc earlier but we only made it the default allocator in 0.5.0 so it's good to know so we can work with the jemalloc developers to figure out what's wrong

asfimport · 2017-08-03T08:08:16Z

Uwe Korn / @xhochy:
Two things I see in this BT:

The size change is quite small: size=2097216, oldsize=2097152
We're in the are of 2GiB of allocated memory and want to expand the region by a some page, not a full re-allocation.

[~K94] It would be nice if you could try to run into the problematic situation again and get a more detailed traceback with thread apply all bt full instead of bt in gdb. This would help me better understand the problem. I sadly yet fail reproduce locally.

asfimport · 2017-08-03T21:36:53Z

Keith Curtis:
Okay, to get the new backtrace, attached, I had to install the 0.5.0 version from conda:
pyarrow: 0.5.0-np112py35_0 conda-forge

I see there's a lot of threads in there (64?), more than I expected. I ran it from the ipython qtconsole, maybe that has something to do with it. Hope that helps.

asfimport · 2022-08-27T14:41:55Z

Todd Farmer / @toddfarmer:
Transitioning issue from Resolved to Closed to based on resolution field value.

asfimport closed this as completed Aug 2, 2017

asfimport assigned wesm Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python hangs after write a few parquet tables #17324

python hangs after write a few parquet tables #17324

asfimport commented Aug 1, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 3, 2017

asfimport commented Aug 3, 2017

asfimport commented Aug 27, 2022

python hangs after write a few parquet tables #17324

python hangs after write a few parquet tables #17324

Comments

asfimport commented Aug 1, 2017

Original Issue Attachments:

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 2, 2017

asfimport commented Aug 3, 2017

asfimport commented Aug 3, 2017

asfimport commented Aug 27, 2022