-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential for time skew in JFR event time? #884
Comments
Thank you for raising the question. |
I started looking at the source code after noticing potential time skew in data. I visualized per-thread CPU samples taken at 977 Hz from a JVM with dozens of active threads and noticed that hot CPU threads were still reported as executing multiple sampling periods after GC pauses purportedly began. You can clearly see these GC pauses across all the threads (CPU samples disappear) but the threads all appear to enter and leave the GC pause at different times. When I saw the visualization - which I later confirmed by looking at the raw JFR events - I immediately thought that's too much jitter: it kinda looks like CPU sample times aren't being recorded properly. I'm receptive to other theories. But the approach of not reading the time as close to stack walk initiation didn't sit well with me, so I filed this issue. There are plenty of other potential theories, of course. I'll comment here if I learn anything more. |
Thank you for your input - it's useful. I think it's easy to adjust the code to get timestamps earlier - at least, before collecting stack traces. I can make a branch to see if it makes data more accurate. |
I've committed the fix to record timestamps early. |
Thank you for the code change! I've performed limited testing of a build from VCS and the recorded times appear to be a bit better. I looked at the commit and the new timing logic appears reasonable to me. But I haven't done very comprehensive testing, so there may still be problems with skew. But I don't suppose there's much we can do about it. |
I was looking at the source code of async-profiler and noticed that the TSC time is queried at JFR event record creation time, which occurs towards the end of the sampling signal handler.
The code leading to the writing of the JFR event entails locks, stack walking, calling into JVMTI internals, etc.
It seems to me there is a potential for time jitter and for the JFR event time to lag the actual sample time by a meaningful margin.
Is the approach of waiting to query the time appropriate or should the time be queried closer to interrupt / instruction pointer resolution?
The text was updated successfully, but these errors were encountered: