[Python] Regression: segfault when reading hive table with v0.14 #16812

asfimport · 2019-07-17T08:50:15Z

I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow installed in a conda env.

The data I'm reading is a hive(-registered) table written as parquet, and with v0.13, reading this table (that is partitioned) does not cause any issues.

The code that worked before and now crashes with v0.14 is simply:

import pyarrow.parquet as pq
pq.ParquetDataset('hdfs:///data/raw/source/table').read()

Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I cannot report much more, but this is a pretty severe usability restriction. So far the solution is to enforce pyarrow<0.14

Reporter: H. Vetinari

Related issues:

[C++/Python] Document how to provide information on segfaults (relates to)

_{Note: This issue was originally created as ARROW-5965. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2019-07-17T14:42:45Z

Neal Richardson / @nealrichardson:
Thanks for the report. A few questions:

Is this reproducible if you try again with the same file? (I wonder if "Killed" means OOM and not segfault)
Could you provide a (preferably as small as possible) Parquet file that triggers this behavior? I think we'll need that in order to identify and fix any issues.

asfimport · 2019-07-17T14:57:05Z

H. Vetinari:
Hey Neal,

I tried a couple of times before filing the report, and all (~5) invocations on 0.14 crashed, and all invocations on 0.13 worked. The machine itself has lots of memory, so I don't think it's that. Not sure I'll be able to pare this down to a minimal reproducing parquet file. I'll try.

asfimport · 2019-07-17T15:05:47Z

Wes McKinney / @wesm:
A gdb backtrace would help us a lot. Do you know how to get one?

asfimport · 2019-07-17T19:08:19Z

H. Vetinari:
@wesm
Would like to provide it, but would only be able to install through conda (which has a hole in the firewall).
Unfortunately,
# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:

- pip -> python[version='>=3.7,<3.8.0a0']

which, I believe, is due to the fact that gdb has not yet been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual...

asfimport · 2019-07-17T19:11:26Z

Wes McKinney / @wesm:
Note I linked this with ARROW-2652 since many users aren't familiar with producing gdb backtraces generated in Python programs

asfimport · 2019-07-18T06:32:08Z

H. Vetinari:
@wesm
Thanks for the tips. Unfortunately, I can't follow that example because the code does not generate a core-dump but only prints "Killed". I found some ways to run it in gdb that should work (best as I can tell), like gdb -ex r --args python fail.py or interactively:
gdb python
(gdb) run fail.py

but I always get:
[...]
warning: Could not trace the inferior process
Error:
warning: ptrace: Operation not permitted
During startup program exited with code 127.

Not sure if that's a mistake on my side or something in the setup/interplay of conda-gdb.

asfimport · 2019-08-19T19:41:37Z

Wes McKinney / @wesm:
I'm guessing this is a dup of the memory issue from ARROW-6060. If you obtain a repro or additional information to suggest it's not a memory problem please reopen

asfimport closed this as completed Aug 19, 2019

asfimport added this to the 0.15.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[C++/Python] Document how to provide information on segfaults #19047

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Regression: segfault when reading hive table with v0.14 #16812

[Python] Regression: segfault when reading hive table with v0.14 #16812

asfimport commented Jul 17, 2019 •

edited

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 18, 2019

asfimport commented Aug 19, 2019

[Python] Regression: segfault when reading hive table with v0.14 #16812

[Python] Regression: segfault when reading hive table with v0.14 #16812

Comments

asfimport commented Jul 17, 2019 • edited

Related issues:

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 18, 2019

asfimport commented Aug 19, 2019

asfimport commented Jul 17, 2019 •

edited