Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send k8s pod events in TaskExecutionEvent updates #3825

Closed
2 tasks done
andrewwdye opened this issue Jul 3, 2023 · 1 comment
Closed
2 tasks done

Send k8s pod events in TaskExecutionEvent updates #3825

andrewwdye opened this issue Jul 3, 2023 · 1 comment
Labels
enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers

Comments

@andrewwdye
Copy link
Contributor

Motivation: Why do you think this is important?

It can be difficult to understand delays in task start up. We recently added runtime metrics to the timeline view to better surface where time is spent, and PodCondition reasons are included in the task state tooltip to explain state transitions.

Screenshot 2023-07-03 at 11 45 15 AM Screenshot 2023-07-03 at 11 41 48 AM

The full text of this tooltip is of the form

7/3/2023 6:41:18 PM UTC task submitted to K8s

7/3/2023 6:41:18 PM UTC Unschedulable:0/5 nodes are available: 1 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.

7/3/2023 6:42:20 PM UTC [ContainersNotReady|ContainerCreating]: containers with unready status: [aqrp2plhk79cj5dwzg5z-n0-0]|

However this doesn't indicate ongoing node allocation or image pull, two of the most common delays in "happy path" task start up. By comparison kubectl get events has much richer information.

❯ kubectl get events -n flytesnacks-development --sort-by='{.metadata.creationTimestamp}' --field-selector involvedObject.name=aqrp2plhk79cj5dwzg5z-n0-0
LAST SEEN   TYPE      REASON                 OBJECT                          MESSAGE
6m57s       Warning   FailedScheduling       pod/aqrp2plhk79cj5dwzg5z-n0-0   0/5 nodes are available: 1 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
6m50s       Normal    TriggeredScaleUp       pod/aqrp2plhk79cj5dwzg5z-n0-0   pod triggered scale-up: [{eks-opta-oc-production-nodegroup1-d7fdbb758a882b40-dec46029-5e1e-5bf0-4999-238661b4dc51 0->1 (max: 5)}]
5m55s       Normal    Scheduled              pod/aqrp2plhk79cj5dwzg5z-n0-0   Successfully assigned flytesnacks-development/aqrp2plhk79cj5dwzg5z-n0-0 to ip-10-0-148-239.us-east-2.compute.internal
5m53s       Normal    Pulling                pod/aqrp2plhk79cj5dwzg5z-n0-0   Pulling image "cr.flyte.org/flyteorg/flytekit:py3.9-latest"
5m52s       Normal    TaintManagerEviction   pod/aqrp2plhk79cj5dwzg5z-n0-0   Cancelling deletion of Pod flytesnacks-development/aqrp2plhk79cj5dwzg5z-n0-0
5m23s       Normal    Pulled                 pod/aqrp2plhk79cj5dwzg5z-n0-0   Successfully pulled image "cr.flyte.org/flyteorg/flytekit:py3.9-latest" in 30.158785098s
5m23s       Normal    Created                pod/aqrp2plhk79cj5dwzg5z-n0-0   Created container aqrp2plhk79cj5dwzg5z-n0-0
5m23s       Normal    Started                pod/aqrp2plhk79cj5dwzg5z-n0-0   Started container aqrp2plhk79cj5dwzg5z-n0-0

Goal: What should the final outcome look like, ideally?

The execution closer should include task-specific event details, including scheduling attempts, node allocations, and image pulls.

Describe alternatives you've considered

A more complete solution may overhaul event information in the execution closure so that reasons are not coupled to Flyte state transitions and could instead surface a sink of structured or unstructured event information. This is beyond the scope of this particular issue, but the proposal below does not preclude such an investment in the future.

Propose: Link/Inline OR Additional context

As a potential solution, update DemystifyPending to interleave k8s pod events alongside existing PodCondition reasons.

Note the reporting interface assumes a single-event per state; however, a recent change made it possible to report multiple events using a phase version.

A relatively naive solutions proposed by @hamersaw might be:

  • Have propeller keep a watch on k8s events. I assume the kube-client has this functionality. Store these in a local cache (with configurable size) and keyed on the object or Flyte task they are associated with.
  • When sending a TaskExecutionEvent we could lookup the k8s events and instead of a singular reason, return a list of reasons (probably update the name) containing all unreported k8s events (use some kind of lastSeen indicator - timestamp, revisionVersion, hash of message, etc).
  • Merge the k8s events into the ExecutionClosure.

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@andrewwdye andrewwdye added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Jul 3, 2023
@andrewwdye andrewwdye changed the title Include k8s pod events in DemystifyPending Send k8s pod events in TaskExecutionEvent updates Sep 22, 2023
@hamersaw
Copy link
Contributor

hamersaw commented Nov 8, 2023

Closing as completed.

@hamersaw hamersaw closed this as completed Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers
Projects
None yet
Development

No branches or pull requests

2 participants