Send k8s pod events in TaskExecutionEvent updates #3825

andrewwdye · 2023-07-03T19:05:17Z

Motivation: Why do you think this is important?

It can be difficult to understand delays in task start up. We recently added runtime metrics to the timeline view to better surface where time is spent, and PodCondition reasons are included in the task state tooltip to explain state transitions.

The full text of this tooltip is of the form

7/3/2023 6:41:18 PM UTC task submitted to K8s

7/3/2023 6:41:18 PM UTC Unschedulable:0/5 nodes are available: 1 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.

7/3/2023 6:42:20 PM UTC [ContainersNotReady|ContainerCreating]: containers with unready status: [aqrp2plhk79cj5dwzg5z-n0-0]|

However this doesn't indicate ongoing node allocation or image pull, two of the most common delays in "happy path" task start up. By comparison kubectl get events has much richer information.

❯ kubectl get events -n flytesnacks-development --sort-by='{.metadata.creationTimestamp}' --field-selector involvedObject.name=aqrp2plhk79cj5dwzg5z-n0-0
LAST SEEN   TYPE      REASON                 OBJECT                          MESSAGE
6m57s       Warning   FailedScheduling       pod/aqrp2plhk79cj5dwzg5z-n0-0   0/5 nodes are available: 1 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
6m50s       Normal    TriggeredScaleUp       pod/aqrp2plhk79cj5dwzg5z-n0-0   pod triggered scale-up: [{eks-opta-oc-production-nodegroup1-d7fdbb758a882b40-dec46029-5e1e-5bf0-4999-238661b4dc51 0->1 (max: 5)}]
5m55s       Normal    Scheduled              pod/aqrp2plhk79cj5dwzg5z-n0-0   Successfully assigned flytesnacks-development/aqrp2plhk79cj5dwzg5z-n0-0 to ip-10-0-148-239.us-east-2.compute.internal
5m53s       Normal    Pulling                pod/aqrp2plhk79cj5dwzg5z-n0-0   Pulling image "cr.flyte.org/flyteorg/flytekit:py3.9-latest"
5m52s       Normal    TaintManagerEviction   pod/aqrp2plhk79cj5dwzg5z-n0-0   Cancelling deletion of Pod flytesnacks-development/aqrp2plhk79cj5dwzg5z-n0-0
5m23s       Normal    Pulled                 pod/aqrp2plhk79cj5dwzg5z-n0-0   Successfully pulled image "cr.flyte.org/flyteorg/flytekit:py3.9-latest" in 30.158785098s
5m23s       Normal    Created                pod/aqrp2plhk79cj5dwzg5z-n0-0   Created container aqrp2plhk79cj5dwzg5z-n0-0
5m23s       Normal    Started                pod/aqrp2plhk79cj5dwzg5z-n0-0   Started container aqrp2plhk79cj5dwzg5z-n0-0

Goal: What should the final outcome look like, ideally?

The execution closer should include task-specific event details, including scheduling attempts, node allocations, and image pulls.

Describe alternatives you've considered

A more complete solution may overhaul event information in the execution closure so that reasons are not coupled to Flyte state transitions and could instead surface a sink of structured or unstructured event information. This is beyond the scope of this particular issue, but the proposal below does not preclude such an investment in the future.

Propose: Link/Inline OR Additional context

As a potential solution, update DemystifyPending to interleave k8s pod events alongside existing PodCondition reasons.

Note the reporting interface assumes a single-event per state; however, a recent change made it possible to report multiple events using a phase version.

A relatively naive solutions proposed by @hamersaw might be:

Have propeller keep a watch on k8s events. I assume the kube-client has this functionality. Store these in a local cache (with configurable size) and keyed on the object or Flyte task they are associated with.
When sending a TaskExecutionEvent we could lookup the k8s events and instead of a singular reason, return a list of reasons (probably update the name) containing all unreported k8s events (use some kind of lastSeen indicator - timestamp, revisionVersion, hash of message, etc).
Merge the k8s events into the ExecutionClosure.

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

hamersaw · 2023-11-08T21:32:08Z

Closing as completed.

andrewwdye added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Jul 3, 2023

andrewwdye mentioned this issue Aug 3, 2023

Add k8s events to task phase updates flyteorg/flytepropeller#600

Merged

8 tasks

This was referenced Sep 21, 2023

Add batched reasons to TaskExecutionEvent flyteorg/flyteidl#443

Merged

Handle batched TaskExecutionEvent reasons flyteorg/flyteadmin#615

Merged

Plugin changes for plumbing k8s events into TaskExecutionEvent flyteorg/flyteplugins#406

Merged

andrewwdye changed the title ~~Include k8s pod events in DemystifyPending~~ Send k8s pod events in TaskExecutionEvent updates Sep 22, 2023

hamersaw closed this as completed Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send k8s pod events in TaskExecutionEvent updates #3825

Send k8s pod events in TaskExecutionEvent updates #3825

andrewwdye commented Jul 3, 2023

hamersaw commented Nov 8, 2023

Send k8s pod events in TaskExecutionEvent updates #3825

Send k8s pod events in TaskExecutionEvent updates #3825

Comments

andrewwdye commented Jul 3, 2023

Motivation: Why do you think this is important?

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

hamersaw commented Nov 8, 2023