Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when using viztracer to profile a numpy program #56

Closed
ogrisel opened this issue Apr 29, 2022 · 6 comments
Closed

Segfault when using viztracer to profile a numpy program #56

ogrisel opened this issue Apr 29, 2022 · 6 comments

Comments

@ogrisel
Copy link

ogrisel commented Apr 29, 2022

Minimal reproducer:

$ pip install viztracer numpy
$ python -m viztracer -c "import numpy"
Saving trace data, this could take a whileSegmentation fault (core dumped)

Here is the gdb backtrace:

Saving trace data, this could take a while                                      
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
__strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:77
77	../sysdeps/x86_64/multiarch/strlen-evex.S: No such file or directory.
(gdb) bt
#0  __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:77
#1  0x00007ffff7d23ac9 in __GI__IO_fputs (str=0x0, fp=fp@entry=0x5555559b03c0) at iofputs.c:33
#2  0x00007ffff6ec6191 in fprintfeename (fptr=fptr@entry=0x5555559b03c0, node=node@entry=0x4be89f6bcf0) at src/viztracer/modules/eventnode.c:163
#3  0x00007ffff6ec7b32 in snaptrace_dump (self=0x4be7d589e20, args=<optimized out>) at src/viztracer/modules/snaptrace.c:925
#4  0x00005555557bc6ef in method_vectorcall_VARARGS (func=<method_descriptor at remote 0x4be7d42c190>, args=0x7fffffffd3e8, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/descrobject.c:316
#5  0x000055555580de52 in _PyEval_Fast (ts=0x5555559b2570, initial_acc=Register(as_int64 = 0), initial_pc=0x5555558d7d2f <func_vector_call> "\t") at Python/ceval.c:737
#6  0x00005555556a8d1f in _PyEval_Eval (pc=<optimized out>, acc=..., tstate=0x5555559b2570) at Python/ceval_meta.c:2831
#7  _PyFunction_Vectorcall (func=<function at remote 0x4be7d7e4390>, stack=0x7fffffffd4a0, nargsf=<optimized out>, kwnames=<optimized out>) at Python/ceval_meta.c:3228
#8  0x00005555555c7d78 in _PyObject_FastCallDictTstate (kwargs=0x0, nargsf=1, args=0x7fffffffd4a0, callable=<function at remote 0x4be7d7e4390>, tstate=0x5555559b2570) at Objects/call.c:118
#9  _PyObject_Call_Prepend (tstate=tstate@entry=0x5555559b2570, callable=callable@entry=<function at remote 0x4be7d7e4390>, 
    obj=obj@entry=<Finalize(_weakref=<weakref at remote 0x4be7bcb7e50>, _callback=<method at remote 0x4be7bcb7db0>, _args=(), _kwargs={}, _key=(-1, 0), _pid=379537) at remote 0x4be7bc778d0>, 
    args=args@entry=(), kwargs=kwargs@entry=0x0) at Objects/call.c:387
#10 0x0000555555631bec in slot_tp_call (
    self=<Finalize(_weakref=<weakref at remote 0x4be7bcb7e50>, _callback=<method at remote 0x4be7bcb7db0>, _args=(), _kwargs={}, _key=(-1, 0), _pid=379537) at remote 0x4be7bc778d0>, args=(), kwds=0x0)
    at Objects/typeobject.c:6819
#11 0x00005555555c5a6d in _PyObject_MakeTpCall (tstate=0x5555559b2570, 
    callable=<Finalize(_weakref=<weakref at remote 0x4be7bcb7e50>, _callback=<method at remote 0x4be7bcb7db0>, _args=(), _kwargs={}, _key=(-1, 0), _pid=379537) at remote 0x4be7bc778d0>, args=<optimized out>, 
    nargs=0, keywords=0x0) at Objects/call.c:191
#12 0x00005555556a4beb in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7fffffffd5b8, 
    callable=<Finalize(_weakref=<weakref at remote 0x4be7bcb7e50>, _callback=<method at remote 0x4be7bcb7db0>, _args=(), _kwargs={}, _key=(-1, 0), _pid=379537) at remote 0x4be7bc778d0>, tstate=0x5555559b2570)
    at ./Include/cpython/abstract.h:116
#13 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7fffffffd5b8, 
    callable=<Finalize(_weakref=<weakref at remote 0x4be7bcb7e50>, _callback=<method at remote 0x4be7bcb7db0>, _args=(), _kwargs={}, _key=(-1, 0), _pid=379537) at remote 0x4be7bc778d0>, tstate=0x5555559b2570)
    at ./Include/cpython/abstract.h:103
#14 vm_call_function (ts=0x5555559b2570, acc=...) at Python/ceval_meta.c:1341
#15 0x00005555558111ab in _PyEval_Fast (ts=0x5555559b2570, initial_acc=Register(as_int64 = 0), initial_pc=0x4be7d7a69a1 "F\f") at Python/ceval.c:982
#16 0x00005555556a8d1f in _PyEval_Eval (pc=<optimized out>, acc=..., tstate=0x5555559b2570) at Python/ceval_meta.c:2831
#17 _PyFunction_Vectorcall (func=<function at remote 0x4be7d7e4f70>, stack=0x5555559542f0 <_Py_EmptyTupleStruct+48>, nargsf=<optimized out>, kwnames=<optimized out>) at Python/ceval_meta.c:3228
#18 0x000055555576a272 in atexit_callfuncs (module=<optimized out>) at ./Modules/atexitmodule.c:93
#19 0x000055555576a46d in atexit_run_exitfuncs (self=<optimized out>, unused=<optimized out>) at ./Modules/atexitmodule.c:192
#20 0x0000555555810c89 in _PyEval_Fast (ts=0x5555559b2570, initial_acc=Register(as_int64 = 0), initial_pc=0x5555558d9c71 <cfunc_header_noargs> "\n") at Python/ceval.c:769
#21 0x00005555556a7a11 in _PyEval_Eval (pc=<optimized out>, acc=..., tstate=0x5555559b2570) at Python/ceval_meta.c:2831
[...]

Not sure if this is a bug in viztracer or in nogil CPython.

@gaogaotiantian
Copy link

gaogaotiantian commented Apr 29, 2022

I did not have the chance to build the nogil python and test this locally, but it seems like it got the segfault when trying to print node->fee->ml_name. I would guess it's because node->fee->ml_name is no long a valid string and puts failed there.

viztracer does not copy the full name string, instead it stores the pointer to that string. To make this valid, viztracer assumes this string will not be freed or moved before it starts dumping data. The pointer is from cfunc->m_ml->ml_name. Maybe this is not a valid assumption by official definition? I did not think too much into it when I wrote it. That worked for the official cpython, but maybe something was changed in nogil? I think normally PyMethodDef was defined statically and should not be dynamically freed.

I could be completely wrong with my "guess" though. It would be nice to check if node->fee->ml_name was valid before the segfault.

@colesbury
Copy link
Owner

Thanks for the bug report @ogrisel and suggestion @gaogaotiantian . I'm able to reproduce the issue. The immediate cause is that viztracer is getting a PyMethodDescrObject, but it expects a PyCFunctionObject. The underlying cause is probably a bug in the implementation of function tracing in the "nogil" fork, but I'm not 100% sure yet.

@gaogaotiantian
Copy link

gaogaotiantian commented Apr 29, 2022

I did not find the hook for PyEval_SetProfile in nogil, but in official cpython, there's https://github.com/python/cpython/blob/f6656163de483003697d510031827b7512056d55/Python/ceval.c#L7291 to convert the descriptor to PyCFunction. Maybe that's the piece missing?

(edit) I saw the corresponding part in nogil and obviously similar code is in there. I don't have any clue yet for what happened. If you need some help from viztracer side I'd love to do so.

@colesbury
Copy link
Owner

@gaogaotiantian Yes, that was basically the issue. The corresponding version of the code in the "nogil" fork didn't handle when "self" is passed through varargs. I have a fix now here:

851b1aa

The code python -m viztracer -c "import numpy" now works with this change, but I want to test it a bit more before I merge it into the main "nogil" branch.

@ogrisel
Copy link
Author

ogrisel commented May 2, 2022

I confirm that 851b1aa fixes the original problem on my end.

I still have a segfault in another, more complex use case. I will open a dedicated issue.

Edit: issue opened under #58.

@colesbury
Copy link
Owner

This is fixed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants