Signal handling may be delayed, which leads to dropped samples #5
Comments
I've implemented a workaround in my fork that synthesizes fake stack samples for all the missed sampling intervals. Now all I need to do is clean it up and create a fresh PR, without the extra debugging gunk. I seem to be taking unreasonable amounts of time doing this, because I'm busy with other things. :/ |
Thank you for investigating this and the detailed explanation. Can't wait to see the PR! |
any progress on this one? Is this issue still valid, or do you use other tools day? |
I'm sorry, I drifted away to other projects without finding the time to clean this up. :( |
@mgedmin I once wrote a similar tool called live-trace (https://github.com/guettli/live-trace) which dumps the current stacktrace every N milliseconds. But it has the same fundamental problem that you noted: if the main thread is blocked in some C code, signal handling gets delayed for an unbounded time. I tried to find a tool for profiling production environments with low impact, but up to now I could not find a solution. Do you have a hint? |
Just for the records, I asked here to find a solution: https://stackoverflow.com/questions/49030629/statistical-profiling-in-python |
I've previously mentioned this issue in the comments of PR #1, but now that I understand what the problem is, I think it's worth creating a separate issue, not tied to any particular PR.
djdt-flamegraph uses interval timers that periodically generate signals, and registers a signal handler to sample the current stack frame. Now, the way CPython implements signal handlers is that they set an internal flag, which is checked the next time the CPython interpreter enters the main eval loop.
A consequence of this is that if the main thread is blocked in some C code, signal handling gets delayed for an unbounded time, and you're getting a skewed profile picture because you're missing a significant number of samples. Example: executing an SQL query via psycopg2 may delay stack sampling for hundreds or thousands of milliseconds.
The text was updated successfully, but these errors were encountered: