Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault while loading a json file #11044

Closed
hrushikesh198 opened this issue Sep 1, 2021 · 5 comments
Closed

Segmentation fault while loading a json file #11044

hrushikesh198 opened this issue Sep 1, 2021 · 5 comments

Comments

@hrushikesh198
Copy link

hrushikesh198 commented Sep 1, 2021

Hi,

I am trying to load a ~300MB, 1.4M Lines file in JSONL format(one json per line). It generates Segmentation fault. I saw some past issues mentioned that and there were fixes merged. But I still see this issue with pyarrow 3.0.0/4.0.0/5.0.0.
When I try to load a subset of the file it works fine, with the complete data it fails.

Thank you for any help you can offer.

Here is a sample of the data(I can not share the full file since it is private to my company):

$ head -n3 data.json
{"item_id": "100000663", "product_type": "Facial Masks", "brand": "Andalou Naturals", "color": "Other", "gender": "Unisex", "product_name": "Andalou Naturals Face Mask, Instant Luminous, 0.28 Oz"}
{"item_id": "100001838", "product_type": "Dining Tables", "brand": "Liberty Furniture", "color": "Gray", "gender": "", "product_name": "Liberty Furniture Industries Summer House Rectangular Dining Table"}
{"item_id": "100002700", "product_type": "Facial Treatments", "brand": "SkinCeuticals", "color": "", "gender": "Male", "product_name": "SkinCeuticals B3 Metacell Renewal 1.7 Oz"}

I am using python: 3.7.10
pyarrow: 4.0.1 installed using conda(4.9.2) on a Debian 10 machine

Here is a the tiny python script

import faulthandler

# import datasets
from pyarrow import json as paj

faulthandler.enable()  # print stack trace for seg faults

if __name__ == '__main__':
    f1 = "data.json"
    paj.read_json(f1)  # fails with seg fault

Here is the stacktrace from gdb:

(gdb) run segf.py
Starting program: segf.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff47ff700 (LWP 17933)]
[New Thread 0x7ffff0b77700 (LWP 17934)]
[New Thread 0x7fffdb3fb700 (LWP 17935)]
[New Thread 0x7fffda2ff700 (LWP 17936)]
[New Thread 0x7fffd9afe700 (LWP 17937)]
[New Thread 0x7fffd8dff700 (LWP 17938)]
[New Thread 0x7fffcbfff700 (LWP 17939)]
[New Thread 0x7fffcb7fe700 (LWP 17940)]
[New Thread 0x7fffcabff700 (LWP 17941)]
[New Thread 0x7fffc9b7f700 (LWP 17942)]
[New Thread 0x7fffc8dff700 (LWP 17943)]
[New Thread 0x7fffafbff700 (LWP 17944)]
[New Thread 0x7fffae5ff700 (LWP 17945)]
[New Thread 0x7fffadbfe700 (LWP 17946)]
[New Thread 0x7fffa3fff700 (LWP 17947)]
[New Thread 0x7fffa37fe700 (LWP 17948)]
[New Thread 0x7fffa27fd700 (LWP 17949)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffff68ca21a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /opt/conda/lib/python3.7/site-packages/pyarrow/../../../libarrow.so.400
(gdb) backtrace
#0  0x00007ffff68ca21a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /opt/conda/lib/python3.7/site-packages/pyarrow/../../../libarrow.so.400
#1  0x00007ffff70b84f2 in arrow::json::ChunkedListArrayBuilder::Insert(long, std::shared_ptr<arrow::Field> const&, std::shared_ptr<arrow::Array> const&) ()
   from /opt/conda/lib/python3.7/site-packages/pyarrow/../../../libarrow.so.400
#2  0x00007ffff70b6d86 in arrow::json::ChunkedStructArrayBuilder::Finish(std::shared_ptr<arrow::ChunkedArray>*) ()
   from /opt/conda/lib/python3.7/site-packages/pyarrow/../../../libarrow.so.400
#3  0x00007ffff70e9d13 in arrow::json::TableReaderImpl::Read() () from /opt/conda/lib/python3.7/site-packages/pyarrow/../../../libarrow.so.400
#4  0x00007ffff0b82eba in __pyx_pw_7pyarrow_5_json_1read_json(_object*, _object*, _object*) () from /opt/conda/lib/python3.7/site-packages/pyarrow/_json.cpython-37m-x86_64-linux-gnu.so
#5  0x00005555556e2427 in _PyMethodDef_RawFastCallKeywords (method=<optimized out>, self=0x0, args=0x7ffff79a85c8, nargs=<optimized out>, kwnames=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Objects/call.c:693
#6  0x00005555556e3ad8 in _PyCFunction_FastCallKeywords (kwnames=<optimized out>, nargs=<optimized out>, args=0x7ffff79a85c8, func=0x7ffff32f9af0)
    at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Objects/call.c:723
#7  call_function (pp_stack=0x7fffffffd8e0, oparg=<optimized out>, kwnames=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:4568
#8  0x000055555570e74a in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3093
#9  0x0000555555651af2 in PyEval_EvalFrameEx (throwflag=0, f=0x7ffff79a8450) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3930
#10 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>,
    kwargs=<optimized out>, kwcount=<optimized out>, kwstep=<optimized out>, defs=<optimized out>, defcount=<optimized out>, kwdefs=<optimized out>, closure=<optimized out>,
    name=<optimized out>, qualname=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3930
#11 0x0000555555652d09 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>,
    kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:3959
#12 0x000055555572d8ab in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/ceval.c:524
#13 0x0000555555791f53 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7ffff7a11eb0, locals=0x7ffff7a11eb0, flags=<optimized out>, arena=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/pythonrun.c:1035
#14 0x000055555579bfd7 in PyRun_FileExFlags (fp=0x555555925d30, filename_str=<optimized out>, start=<optimized out>, globals=0x7ffff7a11eb0, locals=0x7ffff7a11eb0, closeit=1,
    flags=0x7fffffffdbd0) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/pythonrun.c:988
#15 0x000055555579c1ac in PyRun_SimpleFileExFlags (fp=0x555555925d30, filename=<optimized out>, closeit=1, flags=0x7fffffffdbd0)
    at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Python/pythonrun.c:429
#16 0x000055555579c709 in pymain_run_file (p_cf=0x7fffffffdbd0, filename=<optimized out>, fp=0x555555925d30)
    at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Modules/main.c:456
#17 pymain_run_filename (cf=0x7fffffffdbd0, pymain=0x7fffffffdce0) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Modules/main.c:1646
#18 pymain_run_python (pymain=0x7fffffffdce0) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Modules/main.c:2907
#19 pymain_main (pymain=0x7fffffffdce0) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Modules/main.c:3068
#20 0x000055555579c85c in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Modules/main.c:3103
#21 0x00007ffff7c6b09b in __libc_start_main (main=0x555555631100 <main>, argc=2, argv=0x7fffffffde38, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7fffffffde28) at ../csu/libc-start.c:308
#22 0x0000555555719901 in _start () at /home/conda/feedstock_root/build_artifacts/python_1613748395163/work/Parser/parser.c:325
(gdb)

Here is the memory and ulimit :

(base) xx@xx:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           83Gi       967Mi        77Gi       8.0Mi       4.7Gi        81Gi
Swap:            0B          0B          0B
(base) xx@xx:~$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 342094
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 342094
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
@hrushikesh198
Copy link
Author

After digging through 1.4M lines of data with various debugging scripts, I found there was a json row with "null" as the key ["xyz"] as the value. Removing that row, caused pyarrow to load the json file successfully.

@westonpace
Copy link
Member

This sounds like something that should, at least, generate a much better error message. Is there any chance you can use info to create a small file that reproduces the error?

@amol-
Copy link
Member

amol- commented Sep 2, 2021

I tried a quick test and it seems to handle correctly null: ["xyz"]

import tempfile

with tempfile.NamedTemporaryFile(delete=False, mode="w+") as f:
    f.write('{"null": ["xyz"], "b": 2.0, "c": 1}\n')

import pyarrow as pa
import pyarrow.json

table = pa.json.read_json(f.name)
print(table.to_pydict())

trying with null instead of "null" does lead to a proper error message

pyarrow.lib.ArrowInvalid: JSON parse error: Missing a name for object member. in row 0

If you could provide further details to help reproducing the issue I think it would be great

@hrushikesh198
Copy link
Author

hrushikesh198 commented Sep 2, 2021

Thank you so much looking into the problem.
I could not generate a 5-10 line file that raises segfault. So garbled the text from the smallest file from my project that raises segfault while reading with pyarrow.json.read_json(fname)

segfault_samples.zip

filename numlines contains "null":["46.0"] segfault
largefile-nonull.jl 4899 No No
largefile-withnull.jl 4900 Yes Yes
minifile-withnull.jl 11 Yes No

@westonpace
Copy link
Member

Thank you very much for giving us a reproducible test case. I don't know that it would have been obvious at all without it.

https://issues.apache.org/jira/browse/ARROW-13871

The root cause was that a column needed to be a list (e.g. "null": ["46.0"]) and the final chunk of the file could not contain the column at all.

It didn't appear in small files because they were all one chunk. It actually had nothing to do with the word "null" after all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants