Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes profiled program is left suspended #390

Closed
richvdh opened this issue May 7, 2021 · 12 comments
Closed

Sometimes profiled program is left suspended #390

richvdh opened this issue May 7, 2021 · 12 comments

Comments

@richvdh
Copy link

richvdh commented May 7, 2021

This is intermittent, but: sometimes, once a py-spy record -d <duration> completes, the program being profiled is left in a suspended state.

for example:

richardv@corus:~$ sudo ./env/bin/py-spy record -f raw -p 20463 -d 5 -g
py-spy> Sampling process 100 times a second for 5 seconds. Press Control-C to exit.

py-spy> Wrote raw flamegraph data to '20463-2021-05-07T10:32:48Z.txt'. Samples: 412 Errors: 0
py-spy> You can use the flamegraph.pl script from https://github.com/brendangregg/flamegraph to generate a SVG
richardv@corus:~$ ps 20463
  PID TTY      STAT   TIME COMMAND
20463 ?        Tsl   43:28 /opt/sygnal/env/bin/python -m sygnal.sygnal
@richvdh
Copy link
Author

richvdh commented May 7, 2021

the suspended process can be resumed with kill -CONT <pid>.

I tried poking at it a bit with perf, for example:

sudo perf record -T -e syscalls:sys_enter_ptrace ./env/bin/py-spy record -f raw -p 20463 -d 5 -g 
sudo perf script < perf.data

On a good run (when the process is successfully resumed), this ends with:

          py-spy 31006 [001] 44492463.265317: syscalls:sys_enter_ptrace: request: 0x00000011, pid: 0x00004fef, addr: 0x00000000, data: 0x00000000
          py-spy 31006 [001] 44492463.265327: syscalls:sys_enter_ptrace: request: 0x00000011, pid: 0x00004ff2, addr: 0x00000000, data: 0x00000000
          py-spy 31006 [001] 44492463.265332: syscalls:sys_enter_ptrace: request: 0x00000011, pid: 0x00004ff3, addr: 0x00000000, data: 0x00000000
          py-spy 31006 [001] 44492463.265338: syscalls:sys_enter_ptrace: request: 0x00000011, pid: 0x00004ff4, addr: 0x00000000, data: 0x00000000
          py-spy 31006 [001] 44492463.265342: syscalls:sys_enter_ptrace: request: 0x00000011, pid: 0x00004ff5, addr: 0x00000000, data: 0x00000000
...

0x00000011 is PTRACE_DETACH, so this shows py-spy correctly detaching from the profiled process.

Compare with a run where the process is left suspended, where the perf trace ends with:

          py-spy 31253 [003] 44492574.544683: syscalls:sys_enter_ptrace: request: 0x00000010, pid: 0x00004fef, addr: 0x00000000, data: 0x00000000
          py-spy 31253 [003] 44492574.544711: syscalls:sys_enter_ptrace: request: 0x00000010, pid: 0x00004ff2, addr: 0x00000000, data: 0x00000000
          py-spy 31253 [003] 44492574.544756: syscalls:sys_enter_ptrace: request: 0x00000010, pid: 0x00004ff3, addr: 0x00000000, data: 0x00000000

0x00000010 is PTRACE_ATTACH, so the last thing py-spy does is attach to the traced process.

@richvdh
Copy link
Author

richvdh commented May 7, 2021

I was only able to reproduce this with -f raw. My suspicion is that the svg generation leaves enough time for the DETACH to take place before py-spy exits.

@Jongy
Copy link
Contributor

Jongy commented May 7, 2021

Which py-spy version are you running?

@benfred
Copy link
Owner

benfred commented May 8, 2021

thanks for the bug report! I've managed to reproduce on the latest version of py-spy. It also only happens for me using -raw format.

(@Jongy it looks like the fix you put in for ptrace attach in py-spy v0.3.4 is unrelated to this issue benfred/remoteprocess#7 - I think this bug is to do with thread cleanup on the sampling thread).

@benfred
Copy link
Owner

benfred commented May 8, 2021

I have a first draft attempt at fixing here #391 - @richvdh can you try this out to see if this fixes? (once the build has finished you can download the pre-built wheels from https://github.com/benfred/py-spy/actions/runs/822517355 under 'artifacts' - which will save you from having to compile from source).

I still have to add logic for the subprocess sampling before merging this - but will get that in a couple days

@Jongy
Copy link
Contributor

Jongy commented May 9, 2021

(@Jongy it looks like the fix you put in for ptrace attach in py-spy v0.3.4 is unrelated to this issue benfred/remoteprocess#7 - I think this bug is to do with thread cleanup on the sampling thread).

It's indeed what I thought initially.

But I was under the impression that upon exit, the tracee gets released automatically. It can be reproduced with a simple gdb -p ... on some PID, which will now stop; then pkill -9 $(pgrep gdb) and the stopped PID continues.

@richvdh
Copy link
Author

richvdh commented May 11, 2021

I have a first draft attempt at fixing here #391 - @richvdh can you try this out to see if this fixes? (once the build has finished you can download the pre-built wheels from https://github.com/benfred/py-spy/actions/runs/822517355 under 'artifacts' - which will save you from having to compile from source).

yup, that seems to work fine.

@benfred
Copy link
Owner

benfred commented May 16, 2021

Thanks! Fix is in py-spy 0.3.7 (which will be pushed out to pypi/crates.io automatically by CI in the next 30 minutes or so).

@shyn
Copy link

shyn commented Nov 24, 2022

Encountered this when spying a pytorch program.
py-spy 0.3.14 with sudo py-spy -p <pid>

The kill -CONT <pid> did save my ass then.

@benfred
Copy link
Owner

benfred commented Nov 24, 2022

@shyn I'm sorry to hear that =(

Is there any other information you can share for tracking this down ? I haven't seen that myself since the fix went in

@shyn
Copy link

shyn commented Nov 25, 2022

@shyn I'm sorry to hear that =(

Is there any other information you can share for tracking this down ? I haven't seen that myself since the fix went in

Some environments:
pytorch1.8.2-cuda11.1
python -VV shows Python 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]

When the process is suspended, py-spy top showed nothing. But with --nonblocking I saw the code stuck in dataloader collate stuff.

Will try to reproduce this when free ;)

@shyn
Copy link

shyn commented Nov 25, 2022

Easy to reproduce. Just run py-spy top <pid> -g. The --gil option will suspend the process. Don't know if it's designed to. @benfred

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants