-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Java-API as output option for wall-clock profiling #913
Comments
Thank you for the proposal. The idea generally makes total sense, but I have questions regarding the proposed design.
Why would a new output type depend on the profiling mode? I understand that for your use case, you are interested specifically in the wall-clock profiling, but if we are speaking about a new async-profiler feature supported in upstream, I think it can be generalized.
This is a good goal indeed. In my opinion, it is not a reason for a new output format, though. JFR format is designed with the continuous profiling in mind, and it is widely used for this purpose. If you have concerns about continuous profiling, I'd suggest to extract them into separate issues and solve them independently from the Java API work.
No matter how fast it is, Java consumer is fundamentally unable to cope with a flow of profiling events. For example, when a stop-the-world GC is running, Java consumer thread is stopped and cannot process events at all. To avoid data loss, having a growable buffer is essential.
A common question about polling-style API is what to do if an event is not available.
One
Before working on a PR, I suggest preparing a design doc for discussion. Basing on the above comments, the proposed approach does not necessarily fit well into the current architecture of the profiler.
A growing data structure is not a problem per se, as long as there is a way to flush or trim the structure periodically. This function gives an idea of what takes up the profiler's memory. Besides call traces, this includes various dictionaries: classes, methods, symbols. You'll likely need dictionaries anyway, since resolving each method one by one without caching is extremely inefficient.
With |
Thank you for your Feedback! Your remarks are understandable to me and totally make sense.
I was trying to limit the scope of the feature to not make the resulting PR overly big. What about limiting the scope to only supporting events of type
Makes sense, let's keep the problem of growing datastrcuture and running the profiler continuously out of this issue.
Sorry, I was unprecise here. I mean a non-blocking, polling fashion.
By leaving aside the problem of continuous profiling we can simplify the approach by reusing your existing data structures.
When the output mode is set to
The polling from the Java-API would work as follows:
With this design, the memory occupied by Some clarifications for your additional remarks:
I would propose to add an However, I think the details of the Java API surface are best discussed on a concrete prototype implementation.
With my revised approach, do you still feel the need for a separate doc before creating a prototype-PR? |
Thank you for the update. At first glance, the concept of BufferedEvent and ApiEventBuffer looks controversial. This seems to be a new structure for buffering events which requires writing plenty of C++ code, especially if we intend to support multiple event types. In the mean time, there is already a structure exactly for storing events - the JFR buffer. It already supports all types of events and serializes them in a compact format that can be easily processed both in C++ and Java. Currently, JFR buffers are flushed to a file, but it should be easy to store data in memory instead. Or, even better, an existing recorder may be reused to write data to a pipe-like file descriptor, so that a Java thread may consume an opposite end of the pipe. As a bonus, Java side will automatically benefit from all supported ways of doing I/O: blocking reads, polling, or async I/O with callbacks. This approach is also suitable for streaming events over the wire. This is just the first idea came to my mind, I didn't consider it thoroughly yet. The key point here is that a "proper" design should be 1) generic enough; 2) simple to implement with little changes to the existing architecture; 3) introduce minimum amount of new native APIs. |
Sounds like a good idea to me, though I'm not sure if this actually lowers the amount of C++ code required. Please correct me here if I'm wrong, I'm definitely not an expert in the JFR file structure. I'd suggest that in this "pipe" mode, the flight recorder output just sends the events out without the metadata. Is the binary structure of the events stable or will new fields only be added to the end? If not, we'll also need to somehow expose the structure via the java-api. I initially thought of using unix domain sockets in datagram mode and immediately write events to it from the signal handler as they occur. Therefore if we go the route of streaming binary JFR events, I think we should go with actual posix pipes, especially because their capacity can be configured.
To provide a reasonable low latency, I think Overall, I think this approach would be pretty much closer to #404 and I assume would be sufficient for many people who requested this feature. However, I think it is also going to require quiet some changes in the
This data structure was intended as a thin wrapper around the existing events from |
To pickup my last point from my previous comment regarding the implementation complexity of buffering events: If required, this data structure can be made to support blocking If the growing memory due to the |
Not really. There is no problem in sending events as they come: metadata for parsing JFR records is written at the beginning of the chunk, and each record is prepended with the size header. Dictionaries (constant pools) required for resolution of numeric references may interleave with the events, and there can be as many incremental constant pools in a chunk as you want. Anyhow, the key question is where this low latency requirement comes from. 10ms latency seems unreasonable to me. Such latency cannot be guaranteed not only on the JVM level (considering JVM safepoint pauses), but even on the OS level. I'd say any design that does not tolerate event processing latency of an order of seconds is doomed to failure. So, before implementing the proposed feature, I'd suggest to jump back to the original use case and start designing from the consumer end. |
(sorry, closed by a mistake) |
To add to the previous point, I forsee the inevitable need for post-processing of profiling data anyway (as opposed to on-the-go processing). As an example, I'm currently working on a feature that should reduce the overhead of wall clock profiling by an order of magnitude, both in recording size and CPU consumption. As a result of this optimization, no wall clock events will be emitted for idle threads until they wake up or a timeout elapses (where timeout is measured in seconds). With this in mind, a system that intends to process execution samples should not rely on events arriving immediately for each observed thread, and no assumptions should be made on the appearance order of these events. |
The original use-case motivating this feature for me was the current complexity introduced by the delayed processing of the profiling data for the "inferred-spans" feature :
The idea of my proposal of the
Of course, one would still need to take into accound potential data-races if multiple threads pull events at the same time, but that could be solved by the API consumer.
Great addition! I guess this very same mechanism can also be used for virtual threads maybe some time in the future? I would not expect the Java-API to be stable: If the internal event structure of async-profiler would change, the Java API consumers would have to change likely too due to changed semantics of events. IMO API consumers would need to accept that and have to adapt to changes when upgrading.
Thanks for explaining this. I'm definitely not that familiar with the JFR format and AFAIK it also isn't really specified out publicly. Anyhow, I don't think I have the capabilities and capacity to implement the JFR event streaming like you proposed to an output with limited capacity (e.g. pipe) directly from a signal handler due to the potential problem of either truncation or having blocking writes when the pipe gets full. Also, this wouldn't be very helpful for my original use case (inferred spans), because we'd still have to spill span activation/deactivations to disk and correlate them with the profiling data later. |
As mentioned above, async-profiler already has a structure for storing events. It's OK to extend it to support more use cases. What I am concerned about is adding new
I don't see a fundamental difference between a queue you've proposed and a pipe in this sense. You have to bound the queue anyway to prevent from unlimited growth, and therefore will face the same problem of truncation or waiting. Meanwhile, the default pipe capacity on Linux is 64KB, which is enough to store 4000 events (that's a lot!) |
I see your point and I understand it. If you don't see any other use-cases than our admittedly kind of special one and feel like the cost of maintenance would be too high to justify this change, it makes sense for you to reject this proposal. If we (elastic) need it that much, we could always just maintain a fork and pay the maintenance cost ourselves. Anyway, if this changes over time and new use-cases pop up, don't hesitate to reach out to me for me to contribute.
The difference I'm seeing is the level on which the queue operates vs a pipe: The queue operates on events, pipes operate on bytes. So the difficulties I'm thinking of are for example:
Nonetheless, thank you for taking your time to discuss this! I appreciate it. |
@JonasKunz a few years ago we played a little with a jfr output for async-profiler making the meta-data append-only. It could be the easiest way to solve your issue. In that case you can make chunks small enough for your needs and transfer them via pipes as @apangin suggests. It still have some technical challenges, but they are very solvable. |
This issue is a follow up of this comment.
We would like to add a new output-type for at least wall-clock profiling:
java-api
.When the output is set to this mode, no JFR files are generated and also the
CallTraceStorage
will not be populated.Instead the events will be put into a buffer from which they can be pulled via the Java-API.
I'd propose for simplicity to design the API in a polling fashion:
The goal is to allow the profiler to run continuously for an indefinite amount of time.
The Java-API consumer is responsible for pulling out the data fast enough, so that no data loss occurs.
I'd implement this by having a signal-safe, preallocated pool of native pendants to the
ProfilerSample
introduced above.This can be implemented relatively easily using single-linked lists for free and populated samples, where the heads are managed using atomics. When the Java-API
pollSample
is invoked, a populated sample is copied to the output and put back to the free-list of the pool.If the profiler is not able to put a sample into the pool, because the Java-API consumer didn't pull out the data in a timely fashion,
an error counter shall be incremented instead.
For a start, the Java API would only support pulling out events of type
EXECUTION_SAMPLE
, everything else will be simply dropped.After reading through the sources a bit more, I think I should be able to implement this myself and open a PR.
CallTraceStorage
which would grow over time and therefore prevent continuous profiling?TSC::ticks()
to aSystem.nanotime()
timestamp? We would need this to correctly order samples with application events (Spans in my particular use case).The text was updated successfully, but these errors were encountered: