Add Java-API as output option for wall-clock profiling #913

JonasKunz · 2024-04-04T08:39:31Z

This issue is a follow up of this comment.

We would like to add a new output-type for at least wall-clock profiling: java-api.
When the output is set to this mode, no JFR files are generated and also the CallTraceStorage will not be populated.
Instead the events will be put into a buffer from which they can be pulled via the Java-API.

I'd propose for simplicity to design the API in a polling fashion:


class ProfilerSample {
    long timestamp; //timestamp of the event compatible to System.nanotime()
    long nativeThreadId; //OS-thread-ID of the thread on which the sample was taken
    long[] stackTrace; //stacktrace in form of jmethod-IDs 
    int stackDepth;
}

interface ResolvableMethod {
    long getJmethodId();
    void setDeclaringClass(Class<?> clazz)
    void setMethodName(String name)
    void setSignature(String signature)
}

class AsyncProfiler {

    //returns true if a sample was available and moved to output, false otherwise
    boolean pollSample(ProfilerSample output){...}

    //in order to make use of the samples, we need a way of mapping native thread IDs to threads
    long getNativeThreadId(Thread)

    //allows resolution of jmethodIds
    void resolveMethod(ResolvableMethod method)

}

The goal is to allow the profiler to run continuously for an indefinite amount of time.
The Java-API consumer is responsible for pulling out the data fast enough, so that no data loss occurs.

I'd implement this by having a signal-safe, preallocated pool of native pendants to the ProfilerSample introduced above.
This can be implemented relatively easily using single-linked lists for free and populated samples, where the heads are managed using atomics. When the Java-API pollSample is invoked, a populated sample is copied to the output and put back to the free-list of the pool.

If the profiler is not able to put a sample into the pool, because the Java-API consumer didn't pull out the data in a timely fashion,
an error counter shall be incremented instead.
For a start, the Java API would only support pulling out events of type EXECUTION_SAMPLE, everything else will be simply dropped.

After reading through the sources a bit more, I think I should be able to implement this myself and open a PR.

Would you be willing to accept a PR with this feature?
Is there any other datastructure I'm overlooking in addition to CallTraceStorage which would grow over time and therefore prevent continuous profiling?
Is there any portable way to convert TSC::ticks() to a System.nanotime() timestamp? We would need this to correctly order samples with application events (Spans in my particular use case).

The text was updated successfully, but these errors were encountered:

apangin · 2024-04-07T02:29:40Z

Thank you for the proposal. The idea generally makes total sense, but I have questions regarding the proposed design.

We would like to add a new output-type for at least wall-clock profiling

Why would a new output type depend on the profiling mode? I understand that for your use case, you are interested specifically in the wall-clock profiling, but if we are speaking about a new async-profiler feature supported in upstream, I think it can be generalized.

The goal is to allow the profiler to run continuously for an indefinite amount of time.

This is a good goal indeed. In my opinion, it is not a reason for a new output format, though. JFR format is designed with the continuous profiling in mind, and it is widely used for this purpose. If you have concerns about continuous profiling, I'd suggest to extract them into separate issues and solve them independently from the Java API work.

The Java-API consumer is responsible for pulling out the data fast enough, so that no data loss occurs.

No matter how fast it is, Java consumer is fundamentally unable to cope with a flow of profiling events. For example, when a stop-the-world GC is running, Java consumer thread is stopped and cannot process events at all. To avoid data loss, having a growable buffer is essential. CallTraceStorage is in fact such a buffer, which is also memory efficient in the sense that it deduplicates stack traces. Inventing yet another buffer, which, moreover, will not be that efficient, does not seem to me the right approach. Limiting growth of CallTraceStorage is a separate problem, which is definitely solvable.

I'd propose for simplicity to design the API in a polling fashion

A common question about polling-style API is what to do if an event is not available.

long[] stackTrace; //stacktrace in form of jmethod-IDs

One long is not enough to describe a frame. Frames have different types; a user may be also interested in the exact location (line number and/or bytecode index). Besides that, I would not limit the API to Java frames only: an ability to show native and JVM frames is the unique strength of async-profiler.

Would you be willing to accept a PR with this feature?

Before working on a PR, I suggest preparing a design doc for discussion. Basing on the above comments, the proposed approach does not necessarily fit well into the current architecture of the profiler.

Is there any other datastructure I'm overlooking in addition to CallTraceStorage which would grow over time and therefore prevent continuous profiling?

A growing data structure is not a problem per se, as long as there is a way to flush or trim the structure periodically. This function gives an idea of what takes up the profiler's memory. Besides call traces, this includes various dictionaries: classes, methods, symbols. You'll likely need dictionaries anyway, since resolving each method one by one without caching is extremely inefficient.

Is there any portable way to convert TSC::ticks() to a System.nanotime() timestamp?

With clock=monotonic option, TSC::ticks() uses the same clock as System.nanotime().

JonasKunz · 2024-04-08T11:01:13Z

Thank you for your Feedback! Your remarks are understandable to me and totally make sense.

We would like to add a new output-type for at least wall-clock profiling

Why would a new output type depend on the profiling mode? I understand that for your use case, you are interested specifically in the wall-clock profiling, but if we are speaking about a new async-profiler feature supported in upstream, I think it can be generalized.

I was trying to limit the scope of the feature to not make the resulting PR overly big. What about limiting the scope to only supporting events of type PERF_SAMPLE and EXECUTION_SAMPLE instead?

The goal is to allow the profiler to run continuously for an indefinite amount of time.

This is a good goal indeed. In my opinion, it is not a reason for a new output format, though. JFR format is designed with the continuous profiling in mind, and it is widely used for this purpose. If you have concerns about continuous profiling, I'd suggest to extract them into separate issues and solve them independently from the Java API work.

Makes sense, let's keep the problem of growing datastrcuture and running the profiler continuously out of this issue.

I'd propose for simplicity to design the API in a polling fashion

A common question about polling-style API is what to do if an event is not available.

Sorry, I was unprecise here. I mean a non-blocking, polling fashion.
So if no event is available, this is indicated by the return value of the method and the method returns immediately.

The Java-API consumer is responsible for pulling out the data fast enough, so that no data loss occurs.

No matter how fast it is, Java consumer is fundamentally unable to cope with a flow of profiling events. For example, when a stop-the-world GC is running, Java consumer thread is stopped and cannot process events at all. To avoid data loss, having a growable buffer is essential. CallTraceStorage is in fact such a buffer, which is also memory efficient in the sense that it deduplicates stack traces. Inventing yet another buffer, which, moreover, will not be that efficient, does not seem to me the right approach. Limiting growth of CallTraceStorage is a separate problem, which is definitely solvable.

By leaving aside the problem of continuous profiling we can simplify the approach by reusing your existing data structures.
I'd propose the following:

class BufferedEvent {
    BufferedEvent* next; //pointers for single linked list management
    EventType type;
    int tid; //native thread id on which the event was recorded
    u32 call_trace_id; //reference to the call stacktrace stored in CallTraceStorage
}

//for now only ExecutionEvents are supported
class BufferedExecutionEvent : BufferedEvent {
    ExecutionEvent event;
}

class ApiEventBuffer {
    LinearAllocator alloc; //used to allocated BufferedEvents on demand

    BufferedEvent *signalHandlerOutput; //head of single linked list
    BufferedEvent *pollerBuffer; //head of single linked list
}

When the output mode is set to java-api, the profiler will output the events from the signal handler into the ApiEventBuffer

It allocates a matching BufferedEvent for the Event from the LinearAllocator of the ApiEventBuffer
The resulting event references the calltrace stored in the CallTraceStorage
The BufferedEvent is prepended to the ApiEventBuffer->signalHandlerOutput single linked list using an atomic CAS loop

The polling from the Java-API would work as follows:

The polling function would be synchronized: Only one thread can fetch an event at a time
The poller would first check if ApiEventBuffer->pollerBuffer is not null
- If it is not null, the head of ApiEventBuffer->pollerBuffer is removed and used to populate the Java-datastructure
- If ApiEventBuffer->pollerBuffer is null, it is repopulated from ApiEventBuffer->signalHandlerOutput
  - ApiEventBuffer->signalHandlerOutput is reversed (to return the events in correct order) and moved to ApiEventBuffer->pollerBuffer

With this design, the memory occupied by ApiEventBuffer will grow over time. This means that the profiler periodically needs to be stopped and restarted to clear the memory.

Some clarifications for your additional remarks:

One long is not enough to describe a frame. Frames have different types; a user may be also interested in the exact location (line number and/or bytecode index). Besides that, I would not limit the API to Java frames only: an ability to show native and JVM frames is the unique strength of async-profiler.

I would propose to add an int[] bci array in addition to long[] stackTrace with special values for BCI in case of non-java frames, just like you do it natively within async-profiler.
For native frames IIUC the jmethodId points to a const char* in the dictionary representing the native method name. I think we should copy those to a Java CharBuffer/ByteBuffer member of the ProfilerSample java class eagerly during polling and have the jmethodid represent an offset into that buffer: This way we don't risk intermediate profiler restarts causing the java ProfilerSample to point into the nirvana.

However, I think the details of the Java API surface are best discussed on a concrete prototype implementation.

Before working on a PR, I suggest preparing a design doc for discussion. Basing on the above comments, the proposed approach does not necessarily fit well into the current architecture of the profiler.

With my revised approach, do you still feel the need for a separate doc before creating a prototype-PR?
Like I said above, I feel for the Java-API surface it is easiest to discuss it with a concrete prototype implementation.
I don't care if I need to do several iterations on the PR, I just feel like it's easier to ensure I don't overlook anything than when doing just a design concept.

apangin · 2024-04-15T02:20:17Z

Thank you for the update.
Feel free to outline the proposal in whatever form you like; if you think it's better to express it in the from of PR - that's fine, I just wanted to save you from possibly going too far in the implementation of a design that may not fit well in the existing architecture. Also, if the PR is too big, I'll likely ask for a high-level overview anyway, otherwise it may be difficult to review the code.

At first glance, the concept of BufferedEvent and ApiEventBuffer looks controversial. This seems to be a new structure for buffering events which requires writing plenty of C++ code, especially if we intend to support multiple event types. In the mean time, there is already a structure exactly for storing events - the JFR buffer. It already supports all types of events and serializes them in a compact format that can be easily processed both in C++ and Java. Currently, JFR buffers are flushed to a file, but it should be easy to store data in memory instead. Or, even better, an existing recorder may be reused to write data to a pipe-like file descriptor, so that a Java thread may consume an opposite end of the pipe. As a bonus, Java side will automatically benefit from all supported ways of doing I/O: blocking reads, polling, or async I/O with callbacks. This approach is also suitable for streaming events over the wire.

This is just the first idea came to my mind, I didn't consider it thoroughly yet. The key point here is that a "proper" design should be 1) generic enough; 2) simple to implement with little changes to the existing architecture; 3) introduce minimum amount of new native APIs.

JonasKunz · 2024-04-16T08:18:00Z

Or, even better, an existing recorder may be reused to write data to a pipe-like file descriptor, so that a Java thread may consume an opposite end of the pipe. As a bonus, Java side will automatically benefit from all supported ways of doing I/O: blocking reads, polling, or async I/O with callbacks. This approach is also suitable for streaming events over the wire.

Sounds like a good idea to me, though I'm not sure if this actually lowers the amount of C++ code required.
The main difficulty I see is that the way the metadata is structured (e.g. event structure, stacktraces, class names, etc) is not really suited for low-latency streaming and for streaming over a medium with a small buffer (e.g. pipes).
IINM if we wanted to stream the metadata, we would need to wait for a JFR chunk to be finished before being able to actually write it to the pipe. This would incentivize having very small chunks, which would become very inefficient due to repeating the same stacktraces, classnames etc over and over again.

Please correct me here if I'm wrong, I'm definitely not an expert in the JFR file structure.

I'd suggest that in this "pipe" mode, the flight recorder output just sends the events out without the metadata.
E.g. the reader of the pipe will receive the execution events where the stacktraces are represented just by the u32 id. A separate java API call will be needed to actually resolve the stacktrace. Therefore this wouldn't really be suited for directly sending to the wire or a consumer from another process.

Is the binary structure of the events stable or will new fields only be added to the end? If not, we'll also need to somehow expose the structure via the java-api.

I initially thought of using unix domain sockets in datagram mode and immediately write events to it from the signal handler as they occur.
However, I discarded this idea because one write per event seems not like a good idea performance wise.

Therefore if we go the route of streaming binary JFR events, I think we should go with actual posix pipes, especially because their capacity can be configured.
Because pipes do have a limited capacity, we need to make sure we know what we do if the pipe gets full:

we need to take care to not corrupt the byte-stream to not confuse readers. This means we need to avoid truncation when sending data, so we should use the pipe in O_DIRECT mode and make sure our writes are smaller than PIPE_BUF.
Therefore, FlightRecorder::flushIfNeeded would need to be called with a size of (PIPE_BUF - maximum event size) after an event is recorded
If we can't flush some events because the pipe is full, we need to drop them

To provide a reasonable low latency, I think FlightRecorder::timerTick would need to also attempt to flush the buffers and be invoked more frequently (e.g. once every 10ms instead of once per second).

Overall, I think this approach would be pretty much closer to #404 and I assume would be sufficient for many people who requested this feature. However, I think it is also going to require quiet some changes in the FlightRecorder class, e.g. to avoid sending metadata and respect the PIPE_BUF boundaries.

This seems to be a new structure for buffering events which requires writing plenty of C++ code, especially if we intend to support multiple event types. In the mean time, there is already a structure exactly for storing events - the JFR buffer.

This data structure was intended as a thin wrapper around the existing events from event.h, so it actually shouldn't be too much code. Also I think it wouldn't introduce another place one needs to touch if e.g. a field is added to an event, because that would happen at event.h. It would however be required to add that field to the java representation of that event type.

JonasKunz · 2024-04-18T10:59:34Z

To pickup my last point from my previous comment regarding the implementation complexity of buffering events:
I've provided a pretty much complete implementation for this datastructure in this draft PR.

If required, this data structure can be made to support blocking poll operations using sem_wait/sem_post relatively easily I think.
This would make this data structure quiet useful in general as a middle-man for any kind of output imo: E.g. at the moment the JFR-output is coded in an async-signal safe way. This wouldn't be required if the JFR-output is a thread consuming events from my proposed EventBuffer. Not that this matters for the current JFR-output implementation (because the work has already been done), but maybe for future output options, such as potentially native OpenTelemetry profiling data (OTLP).

If the growing memory due to the LinearAllocator is a concern, that can be solved too using two LinearAllocators: One is the the "active" one used for new events, one is "inactive". Once a certain amount of allocations is reached, those two are swapped. The inactive one can then be cleared, once all events allocated from it have been consumed.

apangin · 2024-04-18T20:32:05Z

if we wanted to stream the metadata, we would need to wait for a JFR chunk to be finished before being able to actually write it to the pipe. This would incentivize having very small chunks, which would become very inefficient due to repeating the same stacktraces, classnames etc over and over again.

Not really. There is no problem in sending events as they come: metadata for parsing JFR records is written at the beginning of the chunk, and each record is prepended with the size header. Dictionaries (constant pools) required for resolution of numeric references may interleave with the events, and there can be as many incremental constant pools in a chunk as you want.

Anyhow, the key question is where this low latency requirement comes from. 10ms latency seems unreasonable to me. Such latency cannot be guaranteed not only on the JVM level (considering JVM safepoint pauses), but even on the OS level. I'd say any design that does not tolerate event processing latency of an order of seconds is doomed to failure.

So, before implementing the proposed feature, I'd suggest to jump back to the original use case and start designing from the consumer end.

apangin · 2024-04-18T20:33:03Z

(sorry, closed by a mistake)

apangin · 2024-04-18T20:58:06Z

To add to the previous point, I forsee the inevitable need for post-processing of profiling data anyway (as opposed to on-the-go processing). As an example, I'm currently working on a feature that should reduce the overhead of wall clock profiling by an order of magnitude, both in recording size and CPU consumption. As a result of this optimization, no wall clock events will be emitted for idle threads until they wake up or a timeout elapses (where timeout is measured in seconds). With this in mind, a system that intends to process execution samples should not rely on events arriving immediately for each observed thread, and no assumptions should be made on the appearance order of these events.

JonasKunz · 2024-04-22T09:35:18Z

Anyhow, the key question is where this low latency requirement comes from. 10ms latency seems unreasonable to me. Such latency cannot be guaranteed not only on the JVM level (considering JVM safepoint pauses), but even on the OS level. I'd say any design that does not tolerate event processing latency of an order of seconds is doomed to failure.

The original use-case motivating this feature for me was the current complexity introduced by the delayed processing of the profiling data for the "inferred-spans" feature :

One of the main pain points of this implementation is that at the moment the processing needs to happen after the profiling session has ended:

When a span is started, a profiling session with a fixed duration is started if none is running already

For the duration of this session, the extension also needs to spill a log of when which span was active at which thread to disk

After the profiling session is over, both the profiling data and the span log is read back and the synthetic spans are reconstructed

The idea of my proposal of the EventBuffer datastructure was not to guarantee a hard latency limit.
Instead, this implementation would guarantee that if a thread polls events from the EventBuffer, it is guaranteed to see the profiler samples which have occurred on the same thread prior to invoking EventBuffer.poll. For the inferred-spans feature, this would yield the following benefits:

No need to spill the log of when spans where active on threads to disk due to being able to process events as they occur (Or rather process all profiler samples which occurred between a span start and end when the span ends)
It is possible to detect whether a profiler sample occurred while a span was active. This enables us to then collect a stacktrace for the span.end() so that we are able to correctly correlate the stackframes without having to guess which one corresponds to the span.

Of course, one would still need to take into accound potential data-races if multiple threads pull events at the same time, but that could be solved by the API consumer.

To add to the previous point, I forsee the inevitable need for post-processing of profiling data anyway (as opposed to on-the-go processing). As an example, I'm currently working on a feature that should reduce the overhead of wall clock profiling by an order of magnitude, both in recording size and CPU consumption. As a result of this optimization, no wall clock events will be emitted for idle threads until they wake up or a timeout elapses (where timeout is measured in seconds).

Great addition! I guess this very same mechanism can also be used for virtual threads maybe some time in the future?
To my understanding, wouldn't this be just an execution sample with a different weight (=the duration for which the thread was asleep) ? And the invariant I explained above (Threads can immediately see their own events) would still hold IIUC.

I would not expect the Java-API to be stable: If the internal event structure of async-profiler would change, the Java API consumers would have to change likely too due to changed semantics of events. IMO API consumers would need to accept that and have to adapt to changes when upgrading.

Not really. There is no problem in sending events as they come: metadata for parsing JFR records is written at the beginning of the chunk, and each record is prepended with the size header. Dictionaries (constant pools) required for resolution of numeric references may interleave with the events, and there can be as many incremental constant pools in a chunk as you want.

Thanks for explaining this. I'm definitely not that familiar with the JFR format and AFAIK it also isn't really specified out publicly. Anyhow, I don't think I have the capabilities and capacity to implement the JFR event streaming like you proposed to an output with limited capacity (e.g. pipe) directly from a signal handler due to the potential problem of either truncation or having blocking writes when the pipe gets full. Also, this wouldn't be very helpful for my original use case (inferred spans), because we'd still have to spill span activation/deactivations to disk and correlate them with the profiling data later.

apangin · 2024-04-22T23:42:43Z

As mentioned above, async-profiler already has a structure for storing events. It's OK to extend it to support more use cases. What I am concerned about is adding new EventBuffer that lives in parallel with existing structures and partially mirrors their functionality. Even though this may look easy to implement, the additional structure with the corresponding API will have permanent maintenance cost, while being used primarily by one customer. What I propose instead is to provide generic API which can be potentially used in other scenarios and which the project maintainers will be happy to support.

to an output with limited capacity (e.g. pipe) directly from a signal handler due to the potential problem of either truncation or having blocking writes when the pipe gets full

I don't see a fundamental difference between a queue you've proposed and a pipe in this sense. You have to bound the queue anyway to prevent from unlimited growth, and therefore will face the same problem of truncation or waiting. Meanwhile, the default pipe capacity on Linux is 64KB, which is enough to store 4000 events (that's a lot!)

JonasKunz · 2024-04-23T11:40:13Z

Even though this may look easy to implement, the additional structure with the corresponding API will have permanent maintenance cost, while being used primarily by one customer. What I propose instead is to provide generic API which can be potentially used in other scenarios and which the project maintainers will be happy to support.

I see your point and I understand it. If you don't see any other use-cases than our admittedly kind of special one and feel like the cost of maintenance would be too high to justify this change, it makes sense for you to reject this proposal. If we (elastic) need it that much, we could always just maintain a fork and pay the maintenance cost ourselves.

Anyway, if this changes over time and new use-cases pop up, don't hesitate to reach out to me for me to contribute.

I don't see a fundamental difference between a queue you've proposed and a pipe in this sense. You have to bound the queue anyway to prevent from unlimited growth, and therefore will face the same problem of truncation or waiting. Meanwhile, the default pipe capacity on Linux is 64KB, which is enough to store 4000 events (that's a lot!)

The difference I'm seeing is the level on which the queue operates vs a pipe: The queue operates on events, pipes operate on bytes. So the difficulties I'm thinking of are for example:

When writing an event, you need to anticipate whether that write would block/truncate the event-bytes and based on that try to not write the event at all.
What about the symbols (e.g. stacktraces) being written when finalizing a JFR-chunk? Here I assume a blocking write from the existing background timer thread would need to be used, so that we don't end writing half chungs. At the same time, one must ensure that this doesn't interfer with the signal handlers trying to write events.
Even though 64kb is a lot, like you mentioned earlier you cannot trust that the consumer is eager to consume the data and isn't starved. I think it requires writing the code in a way assuming that the consumer misbehaves and doesn't read any data at all.

Nonetheless, thank you for taking your time to discuss this! I appreciate it.

Artyomcool · 2024-06-16T21:04:01Z

@JonasKunz a few years ago we played a little with a jfr output for async-profiler making the meta-data append-only. It could be the easiest way to solve your issue. In that case you can make chunks small enough for your needs and transfer them via pipes as @apangin suggests. It still have some technical challenges, but they are very solvable.
Parsing (or pre-parsing) of such pipe output can be even done inside the java process.

SylvainJuge mentioned this issue Apr 8, 2024

Migrate Inferred spans to async-profiler 3.x elastic/elastic-otel-java#173

Closed

JonasKunz mentioned this issue Apr 11, 2024

Inferred spans can cause cycles in parent-child trace structure elastic/elastic-otel-java#200

Closed

JonasKunz mentioned this issue Apr 18, 2024

[PoC] Implemented multi-producer, single-consumer queue for profiler events #918

Closed

apangin closed this as completed Apr 18, 2024

apangin reopened this Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Java-API as output option for wall-clock profiling #913

Add Java-API as output option for wall-clock profiling #913

JonasKunz commented Apr 4, 2024

apangin commented Apr 7, 2024

JonasKunz commented Apr 8, 2024 •

edited

Loading

apangin commented Apr 15, 2024 •

edited

Loading

JonasKunz commented Apr 16, 2024

JonasKunz commented Apr 18, 2024

apangin commented Apr 18, 2024 •

edited

Loading

apangin commented Apr 18, 2024

apangin commented Apr 18, 2024

JonasKunz commented Apr 22, 2024

apangin commented Apr 22, 2024

JonasKunz commented Apr 23, 2024

Artyomcool commented Jun 16, 2024

Add Java-API as output option for wall-clock profiling #913

Add Java-API as output option for wall-clock profiling #913

Comments

JonasKunz commented Apr 4, 2024

apangin commented Apr 7, 2024

JonasKunz commented Apr 8, 2024 • edited Loading

apangin commented Apr 15, 2024 • edited Loading

JonasKunz commented Apr 16, 2024

JonasKunz commented Apr 18, 2024

apangin commented Apr 18, 2024 • edited Loading

apangin commented Apr 18, 2024

apangin commented Apr 18, 2024

JonasKunz commented Apr 22, 2024

apangin commented Apr 22, 2024

JonasKunz commented Apr 23, 2024

Artyomcool commented Jun 16, 2024

JonasKunz commented Apr 8, 2024 •

edited

Loading

apangin commented Apr 15, 2024 •

edited

Loading

apangin commented Apr 18, 2024 •

edited

Loading