Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Built-in tracing mega issue #26

Closed
shuhaowu opened this issue Jan 26, 2023 · 1 comment · Fixed by #36
Closed

Built-in tracing mega issue #26

shuhaowu opened this issue Jan 26, 2023 · 1 comment · Fixed by #36

Comments

@shuhaowu
Copy link
Contributor

shuhaowu commented Jan 26, 2023

Tracing is an important aspect of developing real-time applications as it allows the developer to identify long-running code blocks. This involves two components: a real-time trace collection system and an offline trace analysis/visualization system. The idea is to integrate trace collection into cactus_rt such that the program is automatically traced during development (either for the entire duration of the run, or be started/stopped dynamically via an external signal). The cactus_rt framework should also allow the program to be traced during production runs should the user opt to do so. If the performance impact of the trace event emission is low and the number of emissions are kept to a reasonably number, there's no reason why tracing can't be done continuously while the program is running to gain better insights into the program under production conditions.

A trace analysis system that includes gantt-chart-style visualization should be available for the tracing data. More complex analysis such as using SQL can also be good.

A bonus feature would be to pass log messages out of the RT thread and be able to format +print in a separate thread/process.

Perfetto

Perfetto is a Google-developed tracing tool with three major components: (1) the tracing SDK, (2) the trace processor, and (3) the trace visualizer. The tracing SDK enables application-specific traces by passing the trace data quickly out of the application process into a tracing service, which can then record the data into a file. It also has the ability to record the data directly in process, via a separate thread. The trace processor allows users to run SQL queries on an existing trace file, which can simplify the trace analysis. The trace visualizer is a web UI that allows for visualization of the trace data in a gantt-chart-style view, as well as providing a web UI for interacting SQL execution.

This theoretically checks all boxes on paper. My understanding on how it works is as follows, based on this document:

  1. When trace events are emitted, it grabs a free page in a shared memory buffer and serializes the protobuf-encoded message into it (via a specialized protozero library that has very low overhead).
  2. An async IPC gets sent to the tracing service which instructs the tracing service to copy the shared memory buffer into its own buffer (central buffer) and mark the shard memory buffer as free again for reuse.
  3. From the central buffer, the data is written either periodically to disk, or written at the end of the program, depending on the configuration.

However, after careful reading of the documentations and quick look through the code base shows that the emission of trace events are not real time safe. Specifically, the documentation states:

At some point one of the set_int_val() calls will hit the slow-path and acquire a new buffer. The overall idea is having a serialization mechanism that is extremely lightweight most of the times and that requires some extra function calls when buffer boundary, so that their [time] cost gets amortized across all trace events.

In the context of the overall Perfetto tracing use case, the slow-path involves grabbing a process-local mutex and finding the next free chunk in the shared memory buffer. Hence writes are lock-free as long as they happen within the thread-local chunk and require a critical section to acquire a new chunk once every 4KB-32KB (depending on the trace configuration).

My understanding is that this occurs during the shared memory buffer write. If a trace event is emitted from the RT thread at the same time as a non-RT thread and the slow-path is triggered (due to the buffer boundary being crossed by the trace packet), a priority inversion problem could occur, which can result in unbounded latency. Further, the documentation suggests that memory allocation occurs in the slow path (not 100% sure on this tho), which can also trigger problems for real-time.

Thus, Perfetto is not suitable for real-time production tracing. However, it's possible we can still use Perfetto to trace in development, and use a compile time flag to disable tracing for release builds.

Even though the Perfetto tracing SDK is unusable in real-time, we might still be able to use the trace processor and visualizer components, if we can emit a Perfetto-compatible data file with a custom tracing solution, perhaps based on LTTng. Since the Perfetto trace processor also takes the Chromium trace JSON format, we can maybe emit that as well.

Also, Perfetto tracing SDK can't pass log messages (by default), but can emit counter information which can be plotted in the UI.

LTTng

TBD.

@shuhaowu
Copy link
Contributor Author

shuhaowu commented Jan 28, 2023

Current todos

  • Investigate into LTTng and have it trace spans and counters.
    • One issue is that LTTng-UST is based on a bunch of macros that look clearly like macros (the macros takes "symbol names" as arguments, as opposed to things like strings/integers/real values usable in C++). This means it'll be difficult to abstract this away for the user of the library.
    • The other issue is that there's no single binary mode. This means you always need to run at least one more process for tracing. This is an usability problem for the user as it raises the barrier to entry.
  • Investigate into the performance aspects of LTTng and see if it is real-time safe.
  • Investigate how to output file formats accepted by Perfetto's processor and trace UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant