[YUNIKORN-657] Expose application failure reasons #258

ghost · 2021-05-05T22:38:57Z

What is this PR for?

When an application fails or is rejected by the scheduler, the pods currently stay in pending forever without visibility on what happened to them. This PR propagates the scheduling error or failure to the pods so that the user will be able to see the reason of failure. If third-party operators (e.g. Spark K8s Operator) are used, they can also react according to the pod status change and update the status of the corresponding CRD and possibly retry. Core side changes are in apache/yunikorn-core#271

What type of PR is it?

Todos

- Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-657

How should this be tested?

Added unit tests to cover this change. When running in a cluster, tested that scheduling failures will be reflected on the pods:

For example, if an application is rejected, its pods will be marked as failed and its status field will be populated with the following information:

status:
  message: ' application ''spark-<ID>'' rejected, cannot
    create queue ''root.spark'' without placement rules'
  phase: Failed
  reason: ApplicationRejected

Another example is that if a gang-scheduled application cannot successfully run all placeholders before they expire, the app pods' status field will show:

status:
  message: Scheduling has timed out due to insufficient resources
  phase: Failed
  reason: InsufficientResources

Screenshots (if appropriate)

Questions:

- The licenses files need update.
- There is breaking changes for older versions.
- It needs documentation.

codecov · 2021-05-06T02:14:32Z

Codecov Report

Merging #258 (5b5c36d) into master (c47ed51) will increase coverage by 0.90%.
The diff coverage is 65.15%.

@@            Coverage Diff             @@
##           master     #258      +/-   ##
==========================================
+ Coverage   59.75%   60.65%   +0.90%     
==========================================
  Files          35       36       +1     
  Lines        3133     3342     +209     
==========================================
+ Hits         1872     2027     +155     
- Misses       1180     1232      +52     
- Partials       81       83       +2

Impacted Files	Coverage Δ
pkg/appmgmt/appmgmt_recovery.go	`67.50% <0.00%> (-8.18%)`	⬇️
pkg/cache/amprotocol_mock.go	`0.00% <0.00%> (ø)`
pkg/cache/node.go	`84.55% <ø> (ø)`
pkg/cache/nodes.go	`79.80% <ø> (ø)`
pkg/common/events/recorder_mock.go	`0.00% <0.00%> (ø)`
pkg/common/resource.go	`90.72% <0.00%> (-9.28%)`	⬇️
pkg/common/utils/gang_utils.go	`65.43% <0.00%> (-16.11%)`	⬇️
pkg/controller/application/app_controller.go	`71.05% <ø> (-0.26%)`	⬇️
...missioncontrollers/webhook/admission_controller.go	`33.74% <0.00%> (+1.00%)`	⬆️
pkg/shim/main.go	`0.00% <ø> (ø)`
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54c9b31...5b5c36d. Read the comment docs.

yangwwei · 2021-05-08T21:58:12Z

pkg/client/kubeclient.go

+func (nc SchedulerKubeClient) Update(pod *v1.Pod) (*v1.Pod, error) {
+	var updatedPod *v1.Pod
+	var err error
+	if updatedPod, err = nc.clientSet.CoreV1().Pods(pod.Namespace).UpdateStatus(pod); err != nil {
+		log.Logger().Warn("failed to update pod",
+			zap.String("namespace", pod.Namespace),
+			zap.String("podName", pod.Name),
+			zap.Error(err))
+		return pod, err
+	}
+	return updatedPod, nil
+}


The function is called "Update" but the actual impl is "UpdateStatus".
I think we should either re-implement this with "nc.clientSet.CoreV1().Pods(pod.Namespace).Update()" or change this function name to "UpdateStatus" to avoid mis-usage.

Good catch. Changed to UpdateStatus

yangwwei · 2021-05-08T22:01:46Z

pkg/common/utils/utils.go

+	if pod.Status.Phase == v1.PodFailed && pod.Status.Reason == constants.ApplicationRejectedFailure || pod.Status.Reason == constants.ApplicationInsufficientResourcesFailure {
+		return false, nil
+	}


Can we simply return false here when pod phase is Failed?
Do we have to check the reason?

You are right. Removed reason check

yangwwei · 2021-05-08T22:13:14Z

pkg/shim/scheduler.go

+	var err error
+	if len(se.GetArgs()) >= 1 {
+		eventArgs := make([]string, 2)
+		if err = events.GetEventArgsAsStrings(eventArgs, se.GetArgs()); err != nil {
+			log.Logger().Error("fail to parse event arg", zap.Error(err))
+		}
+		message := eventArgs[1]
+		err = ss.stateMachine.Event(string(se.GetEvent()), message)
+	} else {
+		err = ss.stateMachine.Event(string(se.GetEvent()))
+	}
 	if err != nil && err.Error() == "no transition" {


Can we use a more general form? The interface looks like

type SchedulerEvent interface { GetEvent() SchedulerEventType GetArgs() []interface{} }

Can we leverage the args here:

err = ss.stateMachine.Event(string(se.GetEvent()), se.GetArgs())

Good point. Updated accordingly. Now it's more generic and cleaner 👍

yangwwei · 2021-05-08T22:14:20Z

Overall looks good, thanks for improving this! Please see some minor comments, thank you @yuchaoran-apple .

yangwwei

LGTM

When the application transits to failed state, publish pod events to all pods of this application to indicate why the application was failed.

Shim side changes for YUNIKORN-657

23891c4

ghost mentioned this pull request May 5, 2021

[YUNIKORN-657] Expose application failure reasons apache/yunikorn-core#271

Merged

10 tasks

yangwwei assigned ghost May 8, 2021

yangwwei requested changes May 8, 2021

View reviewed changes

Addressed review comments

5b5c36d

yangwwei changed the title ~~Shim side changes for YUNIKORN-657~~ [YUNIKORN-657] Expose application failure reasons May 10, 2021

yangwwei approved these changes May 10, 2021

View reviewed changes

yangwwei merged commit 8034184 into apache:master May 10, 2021

ghost deleted the YUNIKORN-657 branch May 10, 2021 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YUNIKORN-657] Expose application failure reasons #258

[YUNIKORN-657] Expose application failure reasons #258

ghost commented May 5, 2021 •

edited by ghost

codecov bot commented May 6, 2021 •

edited

yangwwei May 8, 2021

ghost May 9, 2021

yangwwei May 8, 2021

ghost May 9, 2021

yangwwei May 8, 2021

ghost May 9, 2021

yangwwei commented May 8, 2021

yangwwei left a comment

[YUNIKORN-657] Expose application failure reasons #258

[YUNIKORN-657] Expose application failure reasons #258

Conversation

ghost commented May 5, 2021 • edited by ghost

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

codecov bot commented May 6, 2021 • edited

Codecov Report

yangwwei May 8, 2021

Choose a reason for hiding this comment

ghost May 9, 2021

Choose a reason for hiding this comment

yangwwei May 8, 2021

Choose a reason for hiding this comment

ghost May 9, 2021

Choose a reason for hiding this comment

yangwwei May 8, 2021

Choose a reason for hiding this comment

ghost May 9, 2021

Choose a reason for hiding this comment

yangwwei commented May 8, 2021

yangwwei left a comment

Choose a reason for hiding this comment

ghost commented May 5, 2021 •

edited by ghost

codecov bot commented May 6, 2021 •

edited