Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlay execution info messages in timeline view #3501

Open
hamersaw opened this issue Mar 20, 2023 Discussed in #3429 · 1 comment
Open

Overlay execution info messages in timeline view #3501

hamersaw opened this issue Mar 20, 2023 Discussed in #3429 · 1 comment
Assignees
Labels
ui Admin console user interface
Milestone

Comments

@hamersaw
Copy link
Contributor

Discussed in #3429

Originally posted by hamersaw March 8, 2023

Motivation

The timeline UI view is marginally useful to debug performance, but has a lot of room for improvement. Integrating the runtime metrics breakdown proposed in the performance observability RFC is a step in the right direction, partitioning node executions into a collection of categorized time-series. This representation will help the "what" but misses a lot of the "why". For example, if a particular execution has a large amount of frontend plugin overhead this means that Flyte started the Task but the backend service has not yet indicated the service has started. K8s gurus will be quick to identify that there may be scheduling contention, large image pull times, or a few other likely scenarios. However, this is not easily available to the user even though FlytePropeller has this information available. We currently store a singular "reason" for the current execution status' but may be better off tracking a time-series of reasons to better explain the execution.

Proposal

This proposal outlines a solution for overlaying a collection of human readable messages in the timeline view. The exact representation is VERY open for debate, but I envision something similar to jaeger (time-series telemetry data with events) which uses a single tick mark that displays a message on hover. This solution supplies the "why" in an explanation of the reported execution status that will complement the "what" in the runtime breakdown of the execution time-series. The goal will be to balance utility with simplicity, displaying a "useful" number of messages to improve context.

Implementation

Currently, FlyteAdmin maintains a singular "reason" within the task execution metadata. This is updated in-place on each event from FlytePropeller, meaning the old "reasons" are not persisted. At risk of over-simplifying this, we will need to transition to maintaining a collection of "reasons" with associated timestamps. This will require updates in the following repositories:

  • FlyteIDL: update TaskExecutionClosure to have repeated reasons with associated timestamps.
  • FlyteAdmin: use an append to the "reason" list rather than overwriting the existing singular "reason".
  • FlyteConsole: correctly parse the "reason" list to annotate the timeline UI view.

Open Questions

  1. How should this be visualized? I will leave this discussion for more UI / UX oriented personnel.
  2. Should we add this information to node executions / workflow executions? Currently the "reason" is only tracked for the task-level execution.
  3. Do we need to be able to send multiple reasons in a single task event?
    currently possible to skip phases if execution progresses before FlytePropeller detects and processes the intermediate stage
    could use event buffers to just send multiple events -> probably the better solution
@hamersaw
Copy link
Contributor Author

similar to #3357

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ui Admin console user interface
Projects
None yet
Development

No branches or pull requests

5 participants