Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential for time skew in JFR event time? #884

Closed
indygreg opened this issue Feb 2, 2024 · 5 comments
Closed

Potential for time skew in JFR event time? #884

indygreg opened this issue Feb 2, 2024 · 5 comments

Comments

@indygreg
Copy link

indygreg commented Feb 2, 2024

I was looking at the source code of async-profiler and noticed that the TSC time is queried at JFR event record creation time, which occurs towards the end of the sampling signal handler.

The code leading to the writing of the JFR event entails locks, stack walking, calling into JVMTI internals, etc.

It seems to me there is a potential for time jitter and for the JFR event time to lag the actual sample time by a meaningful margin.

Is the approach of waiting to query the time appropriate or should the time be queried closer to interrupt / instruction pointer resolution?

@apangin
Copy link
Collaborator

apangin commented Feb 4, 2024

Thank you for raising the question.
Did you observe a real problem with the way how timestamps are recorded, or is it more like an assumption from studying the code?
I haven't encountered issues with timestamps before, but if there are some, I will be glad to know details.

@indygreg
Copy link
Author

indygreg commented Feb 4, 2024

I started looking at the source code after noticing potential time skew in data.

I visualized per-thread CPU samples taken at 977 Hz from a JVM with dozens of active threads and noticed that hot CPU threads were still reported as executing multiple sampling periods after GC pauses purportedly began. You can clearly see these GC pauses across all the threads (CPU samples disappear) but the threads all appear to enter and leave the GC pause at different times. When I saw the visualization - which I later confirmed by looking at the raw JFR events - I immediately thought that's too much jitter: it kinda looks like CPU sample times aren't being recorded properly.

I'm receptive to other theories. But the approach of not reading the time as close to stack walk initiation didn't sit well with me, so I filed this issue. There are plenty of other potential theories, of course. I'll comment here if I learn anything more.

@apangin
Copy link
Collaborator

apangin commented Feb 4, 2024

Thank you for your input - it's useful.

I think it's easy to adjust the code to get timestamps earlier - at least, before collecting stack traces. I can make a branch to see if it makes data more accurate.
Anyway, there are delay possibilities that async-profiler cannot directly affect, e.g., a latency between an event occurs and a signal is delivered.

apangin added a commit that referenced this issue Feb 6, 2024
@apangin
Copy link
Collaborator

apangin commented Feb 6, 2024

I've committed the fix to record timestamps early.
Please check if it improves your use case.

@indygreg
Copy link
Author

indygreg commented Mar 6, 2024

Thank you for the code change!

I've performed limited testing of a build from VCS and the recorded times appear to be a bit better. I looked at the commit and the new timing logic appears reasonable to me.

But I haven't done very comprehensive testing, so there may still be problems with skew. But I don't suppose there's much we can do about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants